Taxonomies and Ontologies

Overview of taxonomies and ontologies and their application in OJA analysis.

Taxonomies and ontologies are useful tools for online job ad analysis as they structure and systematize occupations, skills, and other concepts. In the life cycle of an OJA, they provide the crucial link between OJAs and the domain at large.

A taxonomy is a hierarchical system for organizing and categorizing concepts.

Taxonomies for Online Job Ad Analysis

For labor market analysis, various taxonomies were developed for different concepts. Expert commissions developed the taxonomies below and update them regularly - making them a good starting point for any OJA-related project. A good starting point is the following list of taxonomies:

There are two main advantages of using standardized taxonomies. Firstly, it ensures that models build on them and analyses derived from those models are interoperable and relevant to stakeholders at different levels. Using the KldB-2010 taxonomy, for example, ensures that you can cross-reference your findings and analysis with official analysis by the Federal Statistics Office or the Bundesagentur für Arbeit. Secondly, the taxonomies were developed by experts and are relevant to the labor market.

The main challenge of using these pre-defined taxonomies is that they were developed for many use cases and might not fit the problem you want to model directly.

One way of mitigating this challenge is to develop a custom taxonomy for your problem while ensuring you can link its concepts to them.

Developing a Taxonomy

Developing a taxonomy for a data project involves organizing and categorizing data in a logical and meaningful way. There are several approaches you can take when developing a taxonomy. Three general ways were defined by Uschold and Gruninger (1996):

  1. Top-down: This approach involves starting with the most general categories and then breaking them down into smaller, more specific subcategories. For example, you can start with broad industry categories and then drill down into specific job functions and individual job titles.

  2. Middle-out: This approach involves starting with a general set of categories and then adding additional layers of specificity as needed. This approach can be helpful when you have a large amount of data and you're not sure where to start.

Regardless of which approach you choose, it's important to ensure that your taxonomy is logical, intuitive, and most importantly fits your use case.

[...] some fundamental rules in ontology design [...]. These rules may seem rather dogmatic. They can help, however, to make design decisions in many cases.

1) There is no one correct way to model a domain— there are always viable alternatives. The best solution almost always depends on the application that you have in mind and the extensions that you anticipate. 2) Ontology development is necessarily an iterative process. 3) Concepts in the ontology should be close to objects (physical or logical) and relationships in your domain of interest. These are most likely to be nouns (objects) or verbs (relationships) in sentences that describe your domain.

Ontology Development 101: A Guide to Creating Your First Ontology

Data Standards for Taxonomies and Ontologies

Semantic Modeling is a difficult endeavor with many semantic and logical pitfalls and trade-offs. Kai Krüger (BIBB) gives the following example:

When it comes to semantic modelling different approaches yield different results. This must be taken into account when interpreting the results. Perhaps it becomes clearest with an (exaggerated) example:

In a job description, is a sentence like:

"[...] Tasks: Occasional support in research activities on new procedures in the field of Machine Learning"

Now Machine Learning (as a technical skill) could be interpreted as part of the person's tasks or the emphasis is placed on the professional research (and even then it's only supporting and only partially). Both interpretations have advantages and disadvantages.

One could discuss a number of points here that fall under the heading of semantic modeling. However, if you only publish "our model with an F1 score of 97% concludes that tasks in the field of AI have increased by 10%", then all these points are lost. Therefore, things like the structure or selection of a taxonomy or the development of annotation guidelines are not only tasks that are important to achieve the result, but both reflecting on this process and making it transparent are essential prerequisites for the epistemological value, as well as the connectability and comparability of the study.

Kai Krüger (BIBB)

For the evaluation of an ontology - especially in the context of an information extraction task - one must consider both the ontology itself and the results of the extraction. For this, Panos Alexopolous (2020) formulates the following dimensions in his book on semantic modelling.

  • Semantic Precision (= “the degree to which the semantic assertions of a model are accepted to be true”): How precise is the extraction of entities given the ontology and extraction rules? The precision of the extraction is influenced by various aspects: errors within the taxonomy (e.g., entities are not clearly described in the taxonomy), errors in the gold standard (e.g., employers provide the wrong entity for a job posting), lack of expert knowledge from the involved parties (e.g., keywords are incorrectly assigned), vagueness and ambiguity in the taxonomy (e.g., the assignment of a job title is not clear and different experts would make different decisions).

  • Completeness (= “the degree to which elements that should be contained in the model are indeed there”): How many of the actual entities can be found? The completeness of the extraction is primarily influenced by the completeness of the taxonomy and ontology. The more search terms used for extraction, the higher the completeness. Completeness is calculated via the recall of the extraction.

  • Consistency (= “a semantic model is free of logical or semantic contradictions”): The consistency of the ontology is determined by how consistent the enrichment process is.

  • Ambiguity (~conciseness = “the degree to which the model does not contain redundant elements”): The succinctness and vice versa the ambiguity of the ontological extraction can be measured at two levels: At the ontology level, it can be determined by the number of search terms that are assigned to multiple entities (Ontological Ambiguity). At the extraction level (Extraction Ambiguity), it can be measured how many search terms each identify 1) only True Positives (no ambiguity), 2) both True Positives and False Positives, and 3) multiple False Positives.

  • Timeliness (= “the degree to which the model contains elements that reflect the current version of the world”): The currency of the taxonomy is determined by the currency of the entities and all additional attributes, relations, and search words. Professions and competencies and their relations change over time. This primarily results in challenges regarding the maintenance of the models.

  • Relevancy (= “model is relevant when its structure and content are useful and important for a given task or application“): The relevance of the model largely depends on the local study context and specific application.

  • Understandability (= “the ease with which human consumers can understand and utilize the model’s elements, without misunderstanding or doubting their meaning“): The understandability of the extraction particularly refers to how comprehensible the extraction is and whether the results are plausible.

  • Trustworthiness (= "the perception and confidence in the quality of the model by its users")

These dimensions often conflict with each other. Completeness and precision, for example, must be considered together: the more unspecific search terms one includes in the model, the higher the likelihood of recognising entities, but precision decreases at the same time. There's a similar trade-off between completeness and low ambiguity: specific search terms can fundamentally describe several entities. By minimizing the number of unspecific search terms, the completeness of the extraction also decreases.

Last updated