Dataset Curation and Representativity Analysis
Last updated
Last updated
After collecting, structuring, and enriching OJA data, you usually want to create a sample for analysis. At this stage, there are still duplicates in the dataset, and depending on your data collection strategy, there might be structural bias.
We have already dealt with as part of the data enrichment phase. However, removing duplicates for analysis warrants its own section, as there are trade-offs to be aware of.
Deduplicating online job ads (after a duplicate identification algorithm has marked duplicates) can be done in several ways, depending on the specific requirements of your project:
Automatic removal: One approach is to remove all duplicates identified by the algorithm automatically. This can be done by simply deleting duplicate records from the dataset on some property. Usually, this is the publishing date - where you only keep the earliest record of this particular job. Alternatively, job postings can also be picked at random from all instances. The automatic removal can lead to the loss of important information: If job postings are collected from different job portals, the parsing might work better on one platform than another, and you would lose indicators that can be extracted from one but not the other.
Custom rules: Another approach is manually setting more intricate rules for deduplication. For example, if a job posting for a vacancy is available from one website with high data quality, take this one; otherwise, take the job posting from another portal. This approach, however, might introduce another kind of bias to the dataset, which you should analyze and be aware of.
Merging: A third approach is to merge the duplicates into a single record, typically by starting with the earliest job posting and then merging any additional information or indicators available in other instances. This can help to retain all relevant information but can also introduce inaccuracies when there is conflicting information in different postings. Furthermore, this is computationally the most expensive form of deduplicating.
When deduplicating job ads, there are several trade-offs to consider. Automatic removal is the quickest and easiest approach but can lead to losing important information. Merging the duplicates can be helpful if you want to retain all relevant information, but can lead to the loss of important information. Ultimately, it depends on the project's goal and the indicators' distribution.
When creating a dataset for analysis, one thing to consider is whether or not to remove certain job ads from the analysis. For example, job postings published by temporary work or recruitment agencies. These agencies often search broadly for candidates who can be staffed to fill vacancies at different companies. Other job postings that could be excluded from the analysis are internship, volunteer, or freelancer positions.
Before analyzing trends or making comparisons, OJA data is often aggregated. Here are some of the aggregations usually made:
Aggregation by region: The NUTS regions are typical aggregation levels for regional analysis. The Nomenclature of Territorial Units (NUTS) classification of the European Union is a hierarchical system that defines the socio-economic regions of the European Union at different levels. The hierarchical arrangement of the NUTS classification allows data to be aggregated across regions - for example, on level 1 in Germany, the states (Bundesländer), and on level 3, the districts (Landkreise). A prerequisite for this aggregation is usually that the location or the geo-location of the vacancy is provided in the metadata.
Aggregation by time: OJAs can be grouped by day, week, month, quarter, or year based on the information provided when a job ad was published. Analyzing daily new postings can be challenging, with high volatility and stark inter-day variance. Grouping data in bigger time intervals, such as months or quarters, makes analysis usually more robust. Another way of making time-series analysis more robust is not looking at the new posting but instead at the stock of active vacancies/postings. However, in practice, you need a publishing date and when the posting was retired or the vacancy filled (which often is not available). The Statistical Office used a "" method to estimate the current stock of active job ads ().
Aggregation by indicators: Lastly, data can be grouped on any extracted property. Most common would be industry (e.g., Klassifikation der Wirtschaftszweige 2008), occupation (e.g., Klassifikation der Berufe 2010), skills (e.g., ESCO), education level (e.g., ISCED), salary ranges, or employment type. For more taxonomies, see .
Representativity analysis closes the circle to the beginning of the OJA cycle: The question of how and where job postings are published in the first place. And consequently, are OJAs representative of the labor market at large?
Here is an example of the checks that Cedefop proposes in their 2022 paper:
"Two external data sources, the Labour force survey (LFS) and the Job vacancies survey (JVS), were used to evaluate the selectivity of data in terms of several criteria (e.g. comparisons on sectoral, occupational and geographic levels). The comparison of occupations listed in the European skills, competences and occupations classification (ESCO) ( 3 ) taxonomy was used to identify occupations on the labour market absent from OJAs." ()