Taxonomies and Ontologies

Overview of taxonomies and ontologies and their application in OJA analysis.

Taxonomies and ontologies are useful tools for online job ad analysis as they structure and systematize occupations, skills, and other concepts. In the life cycle of an OJA, they provide the crucial link between OJAs and the domain at large.

A taxonomy is a hierarchical system for organizing and categorizing concepts.

Taxonomies for Online Job Ad Analysis

For labor market analysis, various taxonomies were developed for different concepts. Expert commissions developed the taxonomies below and update them regularly - making them a good starting point for any OJA-related project. A good starting point is the following list of taxonomies:

Indicator / AreaNameDescription


The International standard classification of Occupations is an international classification developed and maintained by the International Labour Organization (ILO) for organizing jobs into a clearly defined set of groups according to the tasks and duties in the job.

The classification of Occupations (Klassifikation der Berufe 2010; KldB 2010) is the standard occupation taxonomy in Germany. The classification was developed to describe and systematize the occupational landscape in Germany. The KLDB-2010 is a hierarchical classification with five levels. In doing so, occupations are classified according to a horizontal dimension (occupational expertise) and a vertical dimension (requirement level). The Bundesagentur für Arbeit provides a provisional mapping of the KldB2010 to ISCO.


The O*NET-SOC database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the U.S. labor market. Furthermore, it links knowledge, skills, and abilities to standardized occupations.

Skills / Competences

ESCO (also occupations)

ESCO is the multilingual classification of European Skills, Competences, Qualifications, and Occupations. It identifies and categorizes the relevant concepts for the EU labor market and education and training in 25 European languages. It uses hierarchical relationships between them, metadata, and mappings to the International Standard Classification of Occupations (ISCO) to structure the occupations.

Berufenet is an offering by the Bundesagentur für Arbeit and gives comprehensive information about professions in Germany. With it comes a competence classification scheme.

The Occupational Information Network (O*NET) is developed under the sponsorship of the U.S. Department of Labor/Employment and Training Administration. Apart from the SOC occupations it also provides a comprehensive skill, ability and competence taxonomy.

Sectors / Industries

The Statistical Classification of Economic Activities in the European Community is a system for classifying economic sectors. The more common classification WZ-2008 (Klassifikation der Wirtschaftszweige) is built upon NACE. An economic sector or industry is usually defined as a grouping of similar companies or businesses in terms of the economic activity performed, the manufacturing process, or the products manufactured.


The nomenclature of territorial units for statistics is a geographical system, according to which the territory of the European Union is divided into hierarchical levels. The three hierarchical levels are known as NUTS-1, NUTS-2, and NUTS-3. This classification enables cross-border statistical comparisons at various regional levels within the EU.

Labor market regions offer the advantage of reflecting the spatial aspects of economic activities as accurately as possible, thus providing relevant analytical units for regional research.

Education / Certificates

The International Standard Classification of Education (ISCED) is the reference international classification for organizing educational programs and related qualifications by levels and fields.

Applicant Attraction

A classification of employer attributes, tasks, candidate requirements, and offered benefits on the basis of applicant preferences in regard to gender, age (groups), and (as of January 2024) cultural background. The classification was developed at the University of Innsbruck (Austria) based on representative surveys with German-speaking labor market participants (potential job applicants). In its current form, the Job Ad Decoder Dictionary (JADE Dictionary) contains 468 job ad-related words most of which were found to appear more attractive to men than to women or vice versa, more attractive to younger applicants than to older applicants or vice versa, or to be relevant in both diversity dimensions. The dictionary is directly being used to decode bias in job ads in the online tool Job Ad Decoder.

There are two main advantages of using standardized taxonomies. Firstly, it ensures that models build on them and analyses derived from those models are interoperable and relevant to stakeholders at different levels. Using the KldB-2010 taxonomy, for example, ensures that you can cross-reference your findings and analysis with official analysis by the Federal Statistics Office or the Bundesagentur für Arbeit. Secondly, the taxonomies were developed by experts and are relevant to the labor market.

The main challenge of using these pre-defined taxonomies is that they were developed for many use cases and might not fit the problem you want to model directly.

One way of mitigating this challenge is to develop a custom taxonomy for your problem while ensuring you can link its concepts to them.

Developing a Taxonomy

Developing a taxonomy for a data project involves organizing and categorizing data in a logical and meaningful way. There are several approaches you can take when developing a taxonomy. Three general ways were defined by Uschold and Gruninger (1996):

  1. Bottom-up: This approach starts with the most granular or specific data points and then groups them into larger, more general categories. For example, you can start by identifying individual job titles, then group those titles into broader job functions (such as "engineering" or "sales"), and then group those functions into even broader industry categories. For this approach Unsupervised Classification can be used to develop a data-driven taxonomy.

  2. Top-down: This approach involves starting with the most general categories and then breaking them down into smaller, more specific subcategories. For example, you can start with broad industry categories and then drill down into specific job functions and individual job titles.

  3. Middle-out: This approach involves starting with a general set of categories and then adding additional layers of specificity as needed. This approach can be helpful when you have a large amount of data and you're not sure where to start.

Regardless of which approach you choose, it's important to ensure that your taxonomy is logical, intuitive, and most importantly fits your use case.

[...] some fundamental rules in ontology design [...]. These rules may seem rather dogmatic. They can help, however, to make design decisions in many cases.

1) There is no one correct way to model a domain— there are always viable alternatives. The best solution almost always depends on the application that you have in mind and the extensions that you anticipate. 2) Ontology development is necessarily an iterative process. 3) Concepts in the ontology should be close to objects (physical or logical) and relationships in your domain of interest. These are most likely to be nouns (objects) or verbs (relationships) in sentences that describe your domain.

Ontology Development 101: A Guide to Creating Your First Ontology

Data Standards for Taxonomies and Ontologies

Semantic Modeling is a difficult endeavor with many semantic and logical pitfalls and trade-offs. Kai Krüger (BIBB) gives the following example:

When it comes to semantic modelling different approaches yield different results. This must be taken into account when interpreting the results. Perhaps it becomes clearest with an (exaggerated) example:

In a job description, is a sentence like:

"[...] Tasks: Occasional support in research activities on new procedures in the field of Machine Learning"

Now Machine Learning (as a technical skill) could be interpreted as part of the person's tasks or the emphasis is placed on the professional research (and even then it's only supporting and only partially). Both interpretations have advantages and disadvantages.

One could discuss a number of points here that fall under the heading of semantic modeling. However, if you only publish "our model with an F1 score of 97% concludes that tasks in the field of AI have increased by 10%", then all these points are lost. Therefore, things like the structure or selection of a taxonomy or the development of annotation guidelines are not only tasks that are important to achieve the result, but both reflecting on this process and making it transparent are essential prerequisites for the epistemological value, as well as the connectability and comparability of the study.

Kai Krüger (BIBB)

For the evaluation of an ontology - especially in the context of an information extraction task - one must consider both the ontology itself and the results of the extraction. For this, Panos Alexopolous (2020) formulates the following dimensions in his book on semantic modelling.

  • Semantic Precision (= “the degree to which the semantic assertions of a model are accepted to be true”): How precise is the extraction of entities given the ontology and extraction rules? The precision of the extraction is influenced by various aspects: errors within the taxonomy (e.g., entities are not clearly described in the taxonomy), errors in the gold standard (e.g., employers provide the wrong entity for a job posting), lack of expert knowledge from the involved parties (e.g., keywords are incorrectly assigned), vagueness and ambiguity in the taxonomy (e.g., the assignment of a job title is not clear and different experts would make different decisions).

  • Completeness (= “the degree to which elements that should be contained in the model are indeed there”): How many of the actual entities can be found? The completeness of the extraction is primarily influenced by the completeness of the taxonomy and ontology. The more search terms used for extraction, the higher the completeness. Completeness is calculated via the recall of the extraction.

  • Consistency (= “a semantic model is free of logical or semantic contradictions”): The consistency of the ontology is determined by how consistent the enrichment process is.

  • Ambiguity (~conciseness = “the degree to which the model does not contain redundant elements”): The succinctness and vice versa the ambiguity of the ontological extraction can be measured at two levels: At the ontology level, it can be determined by the number of search terms that are assigned to multiple entities (Ontological Ambiguity). At the extraction level (Extraction Ambiguity), it can be measured how many search terms each identify 1) only True Positives (no ambiguity), 2) both True Positives and False Positives, and 3) multiple False Positives.

  • Timeliness (= “the degree to which the model contains elements that reflect the current version of the world”): The currency of the taxonomy is determined by the currency of the entities and all additional attributes, relations, and search words. Professions and competencies and their relations change over time. This primarily results in challenges regarding the maintenance of the models.

  • Relevancy (= “model is relevant when its structure and content are useful and important for a given task or application“): The relevance of the model largely depends on the local study context and specific application.

  • Understandability (= “the ease with which human consumers can understand and utilize the model’s elements, without misunderstanding or doubting their meaning“): The understandability of the extraction particularly refers to how comprehensible the extraction is and whether the results are plausible.

  • Trustworthiness (= "the perception and confidence in the quality of the model by its users")

These dimensions often conflict with each other. Completeness and precision, for example, must be considered together: the more unspecific search terms one includes in the model, the higher the likelihood of recognising entities, but precision decreases at the same time. There's a similar trade-off between completeness and low ambiguity: specific search terms can fundamentally describe several entities. By minimizing the number of unspecific search terms, the completeness of the extraction also decreases.

Last updated