Dataset Curation and Representativity Analysis

After collecting, structuring, and enriching OJA data, you usually want to create a sample for analysis. At this stage, there are still duplicates in the dataset, and depending on your data collection strategy, there might be structural bias.


We have already dealt with Identifying Duplicates as part of the data enrichment phase. However, removing duplicates for analysis warrants its own section, as there are trade-offs to be aware of.

Deduplicating online job ads (after a duplicate identification algorithm has marked duplicates) can be done in several ways, depending on the specific requirements of your project:

  1. Automatic removal: One approach is to remove all duplicates identified by the algorithm automatically. This can be done by simply deleting duplicate records from the dataset on some property. Usually, this is the publishing date - where you only keep the earliest record of this particular job. Alternatively, job postings can also be picked at random from all instances. The automatic removal can lead to the loss of important information: If job postings are collected from different job portals, the parsing might work better on one platform than another, and you would lose indicators that can be extracted from one but not the other.

  2. Custom rules: Another approach is manually setting more intricate rules for deduplication. For example, if a job posting for a vacancy is available from one website with high data quality, take this one; otherwise, take the job posting from another portal. This approach, however, might introduce another kind of bias to the dataset, which you should analyze and be aware of.

  3. Merging: A third approach is to merge the duplicates into a single record, typically by starting with the earliest job posting and then merging any additional information or indicators available in other instances. This can help to retain all relevant information but can also introduce inaccuracies when there is conflicting information in different postings. Furthermore, this is computationally the most expensive form of deduplicating.

When deduplicating job ads, there are several trade-offs to consider. Automatic removal is the quickest and easiest approach but can lead to losing important information. Merging the duplicates can be helpful if you want to retain all relevant information, but can lead to the loss of important information. Ultimately, it depends on the project's goal and the indicators' distribution.

Automatic removal is usually a good strategy as it distributes indicator values randomly. This is provided there is no bias in the property you are removing duplicates on. This approach doesn't change the distributions of the indicators and doesn't introduce any new bias to the data.

Filtering Data

When creating a dataset for analysis, one thing to consider is whether or not to remove certain job ads from the analysis. For example, job postings published by temporary work or recruitment agencies. These agencies often search broadly for candidates who can be staffed to fill vacancies at different companies. Other job postings that could be excluded from the analysis are internship, volunteer, or freelancer positions.

Aggregation and Unit of Observation

Before analyzing trends or making comparisons, OJA data is often aggregated. Here are some of the aggregations usually made:

  • Aggregation by region: The NUTS regions are typical aggregation levels for regional analysis. The Nomenclature of Territorial Units (NUTS) classification of the European Union is a hierarchical system that defines the socio-economic regions of the European Union at different levels. The hierarchical arrangement of the NUTS classification allows data to be aggregated across regions - for example, on level 1 in Germany, the states (Bundesländer), and on level 3, the districts (Landkreise). A prerequisite for this aggregation is usually that the location or the geo-location of the vacancy is provided in the metadata.

  • Aggregation by time: OJAs can be grouped by day, week, month, quarter, or year based on the information provided when a job ad was published. Analyzing daily new postings can be challenging, with high volatility and stark inter-day variance. Grouping data in bigger time intervals, such as months or quarters, makes analysis usually more robust. Another way of making time-series analysis more robust is not looking at the new posting but instead at the stock of active vacancies/postings. However, in practice, you need a publishing date and when the posting was retired or the vacancy filled (which often is not available). The Statistical Office used a "Pseudostock" method to estimate the current stock of active job ads (Lazzar and Rengers 2021).

  • Aggregation by indicators: Lastly, data can be grouped on any extracted property. Most common would be industry (e.g., Klassifikation der Wirtschaftszweige 2008), occupation (e.g., Klassifikation der Berufe 2010), skills (e.g., ESCO), education level (e.g., ISCED), salary ranges, or employment type. For more taxonomies, see #taxonomies-for-online-job-ad-analysis.

The level of aggregation is an important one for analysis. Especially when making comparisons, it is essential to remember that the more combinations of indicators you analyze, the smaller the observations in each group become. For example, even when you are starting with 10 million online job vacancies per year, grouping them by month (12 categories), district in Germany (400 categories), and by occupation (~1,250 categories) gives you already 6 million groups which make some statistical analysis meaningless.

Representativity Analysis

Representativity analysis closes the circle to the beginning of the OJA cycle: The question of how and where job postings are published in the first place. And consequently, are OJAs representative of the labor market at large?

Here is an example of the checks that Cedefop proposes in their 2022 paper:

"Two external data sources, the Labour force survey (LFS) and the Job vacancies survey (JVS), were used to evaluate the selectivity of data in terms of several criteria (e.g. comparisons on sectoral, occupational and geographic levels). The comparison of occupations listed in the European skills, competences and occupations classification (ESCO) ( 3 ) taxonomy was used to identify occupations on the labour market absent from OJAs." (Cedefop 2022)

The type of representativity analysis that needs to be done for analysis depends mainly on the inference you want to provide. When testing hypotheses about the general labor market space, one must be more careful than with hypotheses about the online job ad space, for example.

Last updated