Data Collection

Collecting and representing OJA data.

There are several data sources for online job ad data. The most common is web scraping of employers' websites, job boards, or aggregation portals. Some job or aggregation portals offer APIs to access the data directly.

The process of how data is stored, collected and processed directly influences the quality of the data, its representativity, and its usefulness. Therefore, a solid understanding of the processes is crucial for interpreting any results down the line.

Data Sources and OJA Landscaping

The study of where online job ads are published is worthy of investigation, as the online job matching market is growing and changing quickly. Cedefop and Eurostat carry out country-specific OJA market analyses, which aim to identify essential job portals that would ensure representative coverage of the system and provide context for interpreting country data.

Link to the blogpost on OJA landscaping by Jiri Branka

Web Scraping

After determining where OJAs can be found, the question is how can the data be collected? The most common technique for this purpose is web scraping. Web scraping refers to the process of extracting data from websites.

In the context of online job analysis, this means that individual job postings can be targeted and automatically downloaded. The process of scraping OJAs is supported on many websites by the implementation of specific data schemas - which for example guide aggregator platforms to pick up on the job ads. While usually referring to downloading targeted job ads, some broader methods aim at identifying relevant job ads themselves.

This process is often called spidering or crawling. Here is how TextKernel, a relevant OJA data provider, describes spidering in the context of their jobfeed:

Jobfeed obtains new jobs from the Internet daily through spidering. In order to achieve broad and deep coverage, Jobfeed uses two spider methods: wild spidering and targeted spidering. The wild spider is a system that works automatically and dynamically. It continuously indexes hundreds of thousands of relevant (company) websites and discovers new job postings.

Targeted spider scripts are created to retrieve jobs from specific – usually large – websites, like job boards, and websites of large employers. Despite their size and complexity, the script ensures that all jobs are found. The targeted spider scripts run multiple times per day.

Textkernel: How Jobfeed by TextKernel works

Automatically crawling websites for relevant job postings might be more efficient but also comes with trade-offs in terms of data quality. Here is Cedefop describing the trade-off:

Crawling uses a programmed robot to browse web portals systematically and download their pages. Crawling is much more general compared to scraping and is easier to develop. However, crawlers collect much more website noise (irrelevant content) and more effort is needed to clean the data before further processing.

Cedefop, Online job vacancies and skills analysis: a Cedefop pan-European approach, Publications Office, 2019

It is important to ensure that the data is collected and used ethically and legally. As web scraping consists sometimes somewhat of a "grey area", it is crucial to ensure that the data collected is used in a way that respects the privacy of individuals and complies with data protection laws. In the German context, the Science Center Berlin commissioned a report from the University of Würzburg on web scraping in independent scientific research in 2018. It comes to the conclusion, that scientists can certainly use the web scraping process from a legal perspective to support their non-commercial research. However, they must consider certain requirements to comply with the law.

Additionally, to data privacy and data protection considerations, some websites have terms of use that prohibit or limit web scraping - which can be found either in their terms of service or in a robots.txt file.

An often overlooked challenge in web scraping is maintainability. While it might be suitable for one-of analysis of a limited sample, maintaining scrapers, crawlers, and spiders over a longer period of time needs considerable resources. Websites change their schemas and URLs. Furthermore, data quality must be monitored closely. While web crawling might seem cheaper compared to buying data from data providers, however, the costs and resources needed for web crawling at scale cannot be underestimated.

API Data Collection / Data Providers

The main challenges with web scraping can sometimes be mitigated by using APIs supported and advertised by OJA data providers. Generally, APIs (Application Programming Interfaces) allow different software systems to communicate with each other and exchange data. APIs allow developers to access and retrieve data from their systems in a controlled and programmatic way. The data accessible over APIs is generally high in quality as they are generated through a controlled web scraping process (with data quality checks) or come directly from structured databases of job portals.

Some API providers also enrich their data to increase its value by attaching standardized job titles, extracting skills or providing data on the employer.

Examples for a job data APIs are the TextKernel JobFeed API or the Lightcast Burning Glass API.

Some data providers may only provide meta data and not the full text data. For your project or product you might want to carefully consider what your requirements are for data quality, your need to develop and employ your own enrichment algorithms and the amount of data you need for analysis.

Using APIs usually requires authentication through an API key or access credentials which require a subscription to the service. Depending on the service offering, APIs can limit the number and frequency of requests that can be made and have other terms of use that must be followed.

While web crawling might seem cheaper compared to the high costs of buying data from data providers, however, the costs and resources needed for web crawling at scale cannot be underestimated.

Job Posting Data Schema

When collecting OJA data and storing it, it is important to consider how job postings are represented as data and how they can be stored in a structured way.

Job postings often have a set of properties and meta data such as title, description (full text), occupation, employment type, location, etc. Every portal or website has its own way of structuring this information. Therefore, there is not one unified schema for representing OJA data.

The closest to a universal specification of a job posting data schema is the JobPosting type provided by schema.org. schema.org is a collaborative community activity that aims to create, maintain, and promote unified data schemas for different kinds of data.

Here is an example of how a job posting might be marked up using the JobPosting type:

<div itemscope itemtype="http://schema.org/JobPosting">
  <h2 itemprop="title">Software Engineer</h2>
  <span itemprop="description">
    We are seeking a highly skilled software engineer to join our team. 
    The successful candidate will have experience building and 
    maintaining software applications, as well as a strong understanding
    of computer science principles.
  </span>
  <div itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
    <span itemprop="name">ACME Corp</span>
  </div>
  <div itemprop="employmentType" itemscope itemtype="http://schema.org/EmploymentType">
    <span itemprop="name">Full-time</span>
  </div>
  <div itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
    <span itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
      <span itemprop="addressLocality">San Francisco</span>,
      <span itemprop="addressRegion">CA</span>
    </span>
  </div>
  <div itemprop="baseSalary" itemscope itemtype="http://schema.org/MonetaryAmount">
    <span itemprop="currency">USD</span>
    <span itemprop="value" itemscope itemtype="http://schema.org/PriceSpecification">
      <span itemprop="minValue">100000</span>
      <span itemprop="maxValue">120000</span>
      <span itemprop="unitText">YEAR</span>
    </span>
  </div>
  <div itemprop="validThrough" content="2022-01-01">December 1, 2023</div>
</div>

By using the JobPosting type and including these properties, OJA providers help search engines and other crawlers or scrapers structure OJAs. Here is an example of the schema explained by Google for better integration with their job ad indexing service.

When designing your own class for representing job posting data it is usually a good idea to either use the schema.org definition directly or something close to it to ensure interoperability. Here is an example of a posting in the JSON format developed by the University of Chicago, which also draws from the schema.org specification:

{
    "incentiveCompensation": "",
    "experienceRequirements": "Here are some experience and requirements",
    "baseSalary": {"maxValue": 0.0, "@type": "MonetaryAmount", "minValue": 0.0},
    "description": "We are looking for a person to fill this job",
    "title": "Bilingual (Italian) Customer Service Rep (Work from Home)",
    "employmentType": "Full-Time",
    "industry": "Call Center / SSO / BPO, Consulting, Sales - Marketing",
    "occupationalCategory": "",
    "qualifications": "Here are some qualifications",
    "educationRequirements": "Not Specified",
    "skills": "Customer Service, Consultant, Entry Level",
    "validThrough": "2014-02-05T00:00:00",
    "jobLocation": {"@type": "Place", "address": {"addressLocality": "Salisbury", "addressRegion": "PA", "@type": "PostalAddress"}},
    "@context": "http://schema.org",
    "alternateName": "Customer Service Representative",
    "datePosted": "2013-03-07",
    "@type": "JobPosting"
}

Data Storage

As discussed above, storing OJA data comes with its own challenges. In general it is advisable to store large amounts of data in a database. There are three major types of databases that all have their advantages and disadvantages when dealing with OJA data:

Relational Databases: Relational databases (such as SQL databases) are well-established and widely used. Since the data is structured in clearly defined schemas, it is easy to work with very complex queries. On the other hand, the use cases should be rather well-defined before setting up the database as any changes will require changes to its schemas.

Document Databases: Often also called No-SQL-Databases (such as MongoDB), these were designed for semi- or unstructured data. Document databases are by nature flexible and scalable, allow for storing OJAs as they are collected. Compared to SQL databases they are not efficient when it comes to querying and data quality must be observed more closely.

Graph Databases: OJAs can be represented as graphs since postings usually share many properties (such as organizations, skills, occupations, etc.). In relational databases, these relations would be linked between the tables. In graph databases, OJAs are represented as a graph network. This makes very complex queries possible and provides a high level of data consistency and data integrity. The power of graph databases in the context of OJA is exemplified by the ontologies provided by the European Union (such as ESCO and ISCO), which are inherently structured as linked open data.

Which database type to choose depends to a great extent on the specific requirements and use cases. While the "natural" environment for semi-structured documents like OJAs is document databases, the other two types usually provide a better interface for large-scale, structured analysis.

Storing OJA data in a tabular format such as .csv or .xlsx is challenging as most properties are nested by nature. For example, one job posting might have multiple related skills or occupations. A workaround might be to store information in separate tables that can be joined using a common key. In practice, this could look as follows:

  • a meta table containing rows with job postings. Each posting is identified using a unique identifier. This table contains information that has clear 1:1 relationships with the posting (when it was published, on which website it was published, etc.)

  • a skills table containing rows with skills. Each skill is mapped to a posting id from the meta table.

This approach however does not scale well and is memory and processing intensive for large datasets.

Last updated