Extraction Methods

Overview of extraction methods commonly used in OJA enrichment processes.

Pre-Processing and Embeddings

Embeddings

Text embedding is a technique used to represent text in a numerical format that can be input into a machine learning model or used for analysis. Most natural language methods rely on embedding methods as they transform natural language text into something that can be used in statistical methods. While often being a pre-processing step, examples, where word embeddings are used directly, are semantic similarity tasks or semantic modeling tasks, e.g., for finding similar documents or synonyms in context.

Some standard text embedding techniques are:

  1. Count vectorization: This technique represents a text as a numerical vector based on the occurrences or frequency of words within the text.

  2. tf-idf (term frequency-inverse document frequency): This method represents a text as a numerical vector based on the frequency of words within the text and the importance of those words in the overall corpus of text.

  3. Word2Vec: This method represents words as numerical vectors based on the context in which they appear. Similar approaches are Tok2Vec, Doc2Vec, or GloVe.

  4. Transformer models: These are a newer type of text-embedding technique based on deep learning models and can capture more complex and contextualized relationships between words. Examples of models that are built on the transformer architecture are BERT and GPT. While you can train transformer models from scratch, it is often easier to take a foundational model and use fine-tuning to adapt it to your data and use case. The only public domain-adapted transformer-based language model for German-speaking job advertisements was published by Gnehm et al. (2022).

While transformer models are the state-of-the-art method for embedding text, traditional methods might work better for some tasks. This can be the case if the modeling goal is straightforward - e.g., if the words themselves are relevant entities rather than the context. If unsure about the complexity of a task, it is usually a good idea to use them as a baseline for comparison.

A critical downside of transformer embeddings is that they are computationally expensive and usually require a machine with a GPU to run efficiently.

Tokenization, PoS, Co-Reference Resolution

Apart from text embeddings, you can usually employ other steps before putting the data into a rule-based or statistical model. Each of the techniques represents its own class of models and might be necessary as preprocessing steps (e.g. to compute embeddings).

Tokenization divides the text into smaller units called tokens. These tokens can be words, phrases, sentences, or other word sequences. Tokenization is often the first step in natural language processing tasks, as it helps to break up the text into manageable pieces that can be more easily analyzed. Splitting texts into sentences is often helpful for the analysis of OJA as some sections of text are highly structured, e.g., bullet points or lists.

"Especially, the bullet-pointed lists can make it challenging for standard tokenizers to split job ad text into sentences. Another characteristic of job ads are lists that begin with a half-sentence, for example: "You are: - communicative, - team-oriented...". In reality, the job ad contains two sentences, "You are communicative" and "You are team-oriented", which standard tokenizers do not account for."

Stefan Winnige (BIBB)

Part-of-speech tagging identifies the part of speech (e.g., noun, verb, adjective) of each word in a text. Part-of-speech tagging can be useful for various natural language processing tasks, such as syntax parsing and text classification. It is also very useful for rule-based matching approaches when identifying, e.g., skills.

Co-reference resolution is a method for identifying and disambiguating mentions of the same entity within a text. For example, if a text mentions "the job seeker should..." and "their skills..." in reference to the same person, co-reference resolution would identify these as references to the same entity. Co-reference resolution can be useful for tasks where concepts must be attributed to different entities. For example: Does the adjective "innovative" refer to the employer, or is it an attribute of the job candidate they are looking for?

Machine Learning Tasks

For the modeling and extraction of the concepts in the chapter Data Enrichment, many different approaches can be used. Generally, there is not one gold-standard approach to every problem. Many concepts need a mix of approaches and models.

The following model architectures are not mutually exclusive - Semantic Similarity can be used for classification, or rule-based matching can be combined with a statistical NER model. However, they give a short overview of different generalized modeling approaches for problems often encountered when working with OJA data.

Document Classification

This model architecture assigns a text document to one or more predefined categories or labels. The model is usually trained on a labeled gold-standard dataset. Classification models can be used for many tasks in OJA analysis, including job title and skill classification. They can also be combined with other approaches for disambiguation or entity linking.

One challenge you often encounter in supervised classification is the number of labels or concepts. Multi-class classification models are often designed for a few classes (<10) but not thousands of classes (as with an occupation taxonomy, for example).

Token Classification

Named entity recognition (NER) is a popular NLP task that identifies and classifies named entities - traditionally people, organizations, and locations - in a text. NER models are a specific implementation of a broader class of models: token classification models. The main difference to the text classification above is that the classification happens not at the document level but at the token level. Tokens can be words, phrases, arbitrary word sequences, or entire sentences. These models are usually used to identify company names in OJAs, generate "skill candidates" or find other concept mentions in a text.

Large taxonomies are a challenge for token classification models as well. To mitigate this, you can use a token-level model to identify skill mentions in a text and then use an entity linking or text similarity approach to find the correct concept.

Unsupervised Classification

Unsupervised classification refers to various machine learning techniques to automatically categorize text into categories or labels without explicit training data. This is in contrast to supervised classification, which involves training a model on a labeled dataset.

There are many different architecture and modeling goals in unsupervised learning, like clustering, dimensionality reduction, or topic modeling. Unsupervised learning can also be a technique for preprocessing data for supervised classification.

Unsupervised classification can be useful for online job ad analysis because it allows for automatically grouping text into categories based on its content without manually labeling the data or a formal taxonomy. In fact, it can be used as a vantage point for developing a taxonomy.

Rule-Based Matching

This type of model architecture uses a set of predefined rules or patterns to identify and extract specific pieces of information from a text. Rule-based matching models can be effective for information extraction and entity-linking tasks. It is an often-used method in online job ad analysis for a couple of reasons: 1) Compared to statistical models, the amount of labeled data required is much lower (usually only for evaluation purposes); 2) when a model is based on a formalized taxonomy, there are often search words and additional information already available; and 3) Using rules and search words gives you more control over the vocabulary than statistical approaches.

For a more in-depth discussion on how to build semantic models, look at the section Data Standards for Taxonomies and Ontologies.

On the other hand, rule-based matching has some significant disadvantages: 1) Curating rules is time-intensive, 2) rule-based models don't generalize beyond the rules that were set, 3) rule-based models should be evaluated against the same standards as statistical models - making it again necessary to create an annotated gold standard.

Rule-based models can be combined with statistical models to improve performance. Different architectures are possible: You can use a rule-based model and a statistical model simultaneously to optimize for recall. You can also use a broad rule-based model to generate candidates and a statistical model for disambiguation.

Semantic Similarity

Semantic similarity is a method to compare the similarity of two or more texts or documents. It doesn't necessarily represent its own class of approaches, as embeddings can already be used to compare documents. However, there are some models, such as Siamese neural networks, that can be trained for similarity.

This can be helpful for many tasks in OJA analysis, such as identifying duplicates, entity linking, and classification.

Last updated