Data Enrichment

Enriching OJA data by extracting relevant concepts from the collected data.

After collecting OJA data from primary or secondary sources, it is often needed to enrich the data for analysis. In very few cases, the indicators and properties we are interested in analyzing are already available in a structured way.

This page is closely related to the Extraction Methodssection but focuses more on specific indicators and properties of OJAs.

The term "enrichment" was chosen for this section as the goal of each of the following steps is to add more structure or new information to an OJA. Duplicate identification is a good example of this framing: The identification of duplicates usually involves adding a separate indicator to a posting and linking it to other postings, rather than directly removing them from a dataset. Dealing with duplicates (e.g. removing them for analysis) then constitutes another process that is related to Dataset Curation and Representativity Analysis.

Text Segmentation

Text segmentation (or "zoning") refers to identifying and classifying paragraphs or text segments in OJAs. Among others, the following segments can be found in many OJAs:

description of the company/employer
description of the job (e.g., tasks)
required candidate profile
description of benefits
contact fields

Segmenting the full text of an OJA can have multiple advantages for downstream extraction or analysis tasks. One example is that segmenting text can drastically increase the performance of rule-based extraction methods.

Job ads contain information on topics such as the company, the job, or required qualifications. For an accurate extraction of skills and tasks, we need to identify the corresponding text zones, as many key terms are ambiguous, for instance ‘dynamic’ might refer to a personality trait or to a dynamic CRM system.
Gnehm, Ann-Sophie, und Simon Clematide. 2020. „Text Zoning and Classification for Job Advertisements in German, French and English“. S. 83–93 in Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science. Association for Computational Linguistics.

Another example is to semantically differentiate meaning in job ads by segmenting.

"Job ads serve both a selection and a marketing function. Analyzing job ads thus requires the segmentation of job ad texts along these functions, with employer description and offer/benefit referring to the marketing function and job description and candidate profile referring to the selection function. Research has shown that a word used in different text segments of job ads leads to different results in applicant attraction. For instance, whereas "flexible" as a candidate's trait appears more attractive for men than for women, "flexible" as an attribute of the employing organization or the work context (as, e.g., in "flexible work hours") is more attractive for women than for men."
Julia Brandl and Petra Eggenhofer-Rehart More details in Eggenhofer-Rehart, P., Brandl, J., & Kohlberger, M. (2022). Flexibility and flexibility are not the same: A genre-sensitive method to measuring gendered wording in job advertisements. Vortrag am Herbstworkshop der Wissenschaftlichen Kommission Personalwesen im Verband der Hochschullehrer für Betriebswirtschaft e.V., Berlin, Germany, 29-30 September 2022)

As online job ads are usually very structured text documents, good results can be achieved: The Gnehm and Clematide (2020) study achieved around 90 percent accuracy, Murauer et al. (2018) achieved 78 percent accuracy, and Hermes and Schandock (2016) got results of up to 97 percent accuracy (F-Score of .95). A viability study by Stops et al. (2021) for extracting skill requirements achieved 98 percent accuracy (F-Score of .98) on this task. It is important to note, however, that the performance of text segmentation algorithms can vary significantly based on the overall data quality and data source. The study results above are therefore not directly comparable and need to be assessed in the specific task they were trained on. Furthermore, the generalization of these models to other datasets is challenging.

Gewusst wie: Analyse von Online-Stellenanzeigen (1/11) – Aufbereitung eines Datenschatzes - Blog Aus- und WeiterbildungBlog Aus- und Weiterbildung

Identifying Duplicates

Duplicates are job postings that refer to the same vacancy. Identifying duplicates of job postings is crucially important for large-scale analysis of online job ads. If there are duplicates in the data set, it can result in the analysis being skewed or misleading. For example, if the same job is included multiple times in the data set, it appears as though there is a higher demand for that particular type of job than there actually is. TextKernel estimates that on average a job ad is reposted two to five times, which leads to 50 to 80 percent of job postings being duplicates.

In order to remove duplicates from a dataset we first have to identify them. This is a challenging task as jobs might be advertised on multiple job sites which are not identical in structure or content. Furthermore jobs might be altered slightly when published in different days. A different challenge is that similar job posting might occur weeks or months apart and represent different vacancies while being semantically very similiar.

"Another tough test for models for duplicate detection would be job postings from a company with different job locations. Linguistically, the job postings could be identical except for the job location. This example makes it clear that a simple examination of text similarity is not sufficient, but features such as job location, employer, and deadline must be extracted from the text, as described in the next paragraph."
Stefan Winnige (BIBB)

To mitigate these challenges, multiple methods are usually used to identify duplicates: The process often involves an evaluation of the Semantic Similarity and the similarity on indicators from the meta data (publishing date, location, organization, job title).

Some results indicate that duplicate identification can achieve very good results with accuracy and F-scores of .95 and above (Zhao et al. 2021). However, results again depend heavily on which dataset is used and its data quality ( see Data Collection ). Furthermore, larger datasets make duplicate identification computationally expensive and increasingly hard to do. TextKernel summarizes their modeling efforts in the following way: "How accurate is the system [...]? Does it find all duplicates? Does it cluster together postings that are not duplicates? A short answer would be that the system is 'pretty good, 90%-ish'. A longer answer would require a discussion of many possible ways to evaluate such a system".

Refer to the section on Deduplication for strategies to remove identified duplicates from a dataset.

Normalizing Job Titles

Normalizing a job title means converting it to a standardized term or category that is used in the taxonomy. For example, the job title "We look for a data science expert" should be linked to the standardized profession "2511.4 - data scientist" in the ESCO classification.

This process is important because it helps to ensure that job titles are consistently represented and understood, regardless of the specific wording or phrasing used. Without normalization, it can be difficult to analyse online job ads in a structured way.

There are a myriad of methods that can be used for job title normalization like Extraction Methods, Extraction Methods, or Semantic Similarity matching. No matter the method, normalizing job titles is a very challenging task because of semantic and construct ambiguity. Some challenges that might arise are:

a job title can refer to multiple occupations; e.g. "Helfer/in im Bau" ("Assistant in construction" without specifying which area of construction)
a job title mentions multiple occupations; e.g. "Hochbauhelfer/in oder Tiefbauhelfer/in", "Service- und Küchenkraft" ("Assistant for building construction or underground construction"; "Service and kitchen staff")
a job title can be unspecific; e.g. "Projektleiter/in Elektro" (Project manager for electrical works)
the taxonomy itself is too detailed for ambiguous job titles; e.g. "Software Developer" and "Frontend Developer" might be two different occupations in a taxonomy, where a job title might not make a distinction.

Job title normalisation models are exceedingly hard to evaluate using traditional metrics. This has to do with the reasons mentioned above, but also with some other challenges: Gold standard datasets that are created synthetically might have issues when transferring them to out-of-sample distributions. Gold standards created from online job postings need heavy preprocessing to ensure a certain class balance (some occupations are just very rare). Even then, taxonomies need to be updated regularly and so do extraction models to ensure that job titles can actually be normalized.

Developing an extraction model for a certain sector or a limited number of occupations might be easier to achieve - for three reasons: 1) It is easier to create balanced gold standard datasets, 2) variance is limited (and data distributions don't incorporate the whole job market), 3) it is easier to incorporate domain knowledge for a limited number of occupations.

Which Taxonomies and Ontologies or classification scheme is used depends mostly on the context of analysis and has vast implications for the design and the downstream performance of the normalisation approach. Often taxonomies are expanded to more complete ontologies that incorporate domain knowledge through synonyms, lexicalizations and relations.

Job postings and taxonomies have different purposes. This conflict cannot be entirely resolved.However, you can be aware of them and use this awareness to guide your interpretation of the results.
Claudia Plaimauer - OJV Forum 2021

Extracting Skills

After the job title, skills and competences in OJAs are the most used properties for analysis. It is important to note that there are many different definitions of what a skill actually constitutes, and it is a highly context-specific construct.

On a technical level, the goal of skill extraction is to identify and classify word sequences in text that relate to skills as defined in the analysis.

As with job titles there are two general approaches one can employ: The first is an ontological approach where a skill taxonomy is carefully curated and enriched with search words, synonyms, lexicalizations and relations. This information is used to build a Extraction Methodsmodel.

The second approach would be a statistical extraction of skills. Here we can differentiate between two steps.

Skill Localization: In this step the word sequence boundaries are found in the text - usually using a statistical model like Extraction Methods. This generates a list of skills that are not yet mapped to a taxonomy. This step might be split up into two smaller steps where first skill candidates are generated and subsequently classifed as containing a skill or not ( Extraction Methods).
Skill Disambiguation: Taken a list of skills the concepts then have to be matched to a taxonomy. This can involve the classification to concepts of a classification scheme or taxonomy or can be down bottom-up using clustering and unsupervised learning. This step is similar to the normalisation of job titles and can involve Semantic Similarity, Extraction Methods or Extraction Methods.

Other Indicators

There are many more indicators that could be of interest for OJA analysis: Data on wages, remote work options, sustainability, company presentation, and many more.

To give just two examples: To analyze gender bias in OJAs, predominantly male-stereotyped words, such as "high-performing" or "career-oriented," can be extracted and enrich the analysis. Another use case is to scrutinize job titles in more detail to derive the rank or seniority of the job within a company. The "Seniority Score" could then be used to suggest suitable job offers to applicants based on their current job title.

These two use cases are described in the following presentation:

Another example for additional indicators is the extraction of sustainability keywords in OJAs. Analyses like this can help to get a better understanding of the signalling in OJAs and how companies present themselves.

The following video describes the approach by Johanna Binnewitt and Timo Schnepf.

PreviousData Collection NextExtraction Methods

Last updated 2 years ago