Evaluation and Quality Control

Evaluating enrichment and extraction models.


Model evaluation is a crucial part of developing models for OJA enrichments because it helps you understand how well your model is performing and whether it is meeting the goals that you have set for it. To evaluate machine learning and extraction models, we need a gold standard (Gold Standard Annotation and Quality) that experts curated. Against the gold standard, we can calculate different metrics to evaluate the model performance (Evaluation Metrics).

Model evaluation for OJA analysis is often more complex than it is for other use cases. This has to do with a) the complexity of the tasks, b) the ambiguity often found in OJA taxonomies, and c) the number of classes we want to predict.

While there are some standard metrics like precision and recall, they often don't tell the whole story and are, on their own, not enough to interpret and judge the performance of a model.

Evaluation Metrics

Which evaluation metrics you use depends heavily on the data, the purpose, and the context of your models. In general, the first step is to calculate a confusion matrix where predicted values and actual values are compared.

The following is an example of a confusion matrix for a classification algorithm that solves the task of classifying zones in an OJA. Each cell is the number of examples for a specific actual vs. predicted value.

From the confusion table, you would usually calculate the number of True Positives, False Positives, and False Negatives, which can then be used to calculate the following metrics:

  1. Accuracy: This is the most straightforward metric and measures the proportion of correct predictions. However, it can be misleading when the distribution of classes is not balanced.

  2. Precision: This is a measure of the false positive rate and is defined as the number of true positives divided by the sum of the true positives and false positives. Intuitively, it represents the expectation "How many cases that were found were actually correct?"

  3. Recall: This is a measure of the false negative rate and is defined as the number of true positives divided by the sum of the true positives and false negatives. Intuitively, it represents the expectation "How many cases that actually exist were correctly found?"

  4. F1 score (or F-Score): This is the harmonic mean of precision and recall and is defined as 2 * (precision * recall) / (precision + recall).

  5. Confusion matrix: This table is used to visualize the performance of a classifier by showing the number of true positives, true negatives, false positives, and false negatives.

  6. ROC curve: This is a plot that shows the true positive rate on the y-axis and the false positive rate on the x-axis and is used to visualize the trade-off between these two rates for different classification thresholds.

Usually, you would expect that precision and recall are reported. When these metrics are used, it is essential to distinguish between micro- and macro-scores. The micro-score tells you the performance over all predictions, while the macro-score is the average of the predictive performance by each class or label. Whenever possible, both scores should be calculated and reported.

It is hard to specify cut-off points for these metrics in order to judge a model. What constitutes a good result is dependent on many different factors, most importantly the complexity of a task. Some tasks might be semantically easier to model, e.g., finding job titles in a job posting, and you would expect high scores on most metrics (for example .8 + on the F1-Score). On the other hand, more complex tasks, e.g., classifying skills from an 8,000-concept-strong taxonomy, might be considered good if the Macro-F1-Score is above .4.

Furthermore, it is difficult to compare different model implementations that are tested on different gold standard data sets. Therefore, it can be argued that it is the responsibility of the authors of a study or model to critically discuss the performance of the models using benchmarks, examples, or comparison values.

Reporting Results

Reporting results - especially for machine learning models - is an important part of the analysis cycle. It is important to communicate the following points so that others can better interpret and use the results:

  • Dataset: Provide information about the dataset used. This should include details about its size, nature, and source. Highlight if any pre-processing steps (like cleaning, normalization, feature extraction, etc.) were undertaken.

  • Taxonomy: Explain which taxonomy was used and why. If the taxonomy was enriched indicate which assumptions were applied and how they influence the results. See more about data standards and metrics in the section Data Standards for Taxonomies and Ontologies.

  • Methods: Explain the methodology adopted. Describe the model used, its configuration, and why it was chosen over other possible models. Include details about the algorithm, and the training and validation process.

  • Performance Metrics: Present and explain the metrics used to evaluate the model's performance (accuracy, precision, recall, F1-score, etc.). Detail how these metrics were calculated, why they were chosen, and what they indicate about the model's performance. An overview of metrics can be found above Evaluation Metrics.

  • Results: Present the results obtained from the model. This can include predictive accuracy, feature importance, confusion matrices, or other relevant findings. Make sure to clarify what these results imply in the context of the study.

  • Limitations: Discuss the limitations of your study. This might encompass constraints with the data, model, computational resources, or the inherent complexity of the problem.

  • Bias: Address any potential biases present in the data or the model. This could be in data collection, data representation, or the model's bias-variance trade-off. Acknowledging bias helps in assessing the fairness and generalizability of the model.

A helpful tool for reporting models is the so-called model card. Huggingface provides a good overview of the literature regarding scorecards: The presented tools and processes can be "contextualized with regard to their focus (e.g., on which part of the ML system lifecycle does the tool focus?) and their intended audiences (e.g., who is the tool designed for?)".

An example for a model score card is the documentation of the extraction algorithm for activity fields ("Teilqualifikationen") by the Bertelsmann Stiftung and &effect data solutions (Müller 2023).

Gold Standard Annotation and Quality

A gold standard is a data set of documents for which human annotators have added labels. This kind of dataset is usually used to train or fine-tune machine learning models and evaluate them.

The development of a gold standard dataset starts with a taxonomy and documents. Depending on the model purpose, the documents are annotated by experts or trained individuals who assign labels to the documents - on the document level or the token level.

To ensure high data quality, annotation guidelines (or coding manuals) are formulated, which help annotators decide on ambiguous cases and explain the concepts to be annotated in more detail.

Developing a gold standard is an iterative process. Usually, it is a good idea to introduce the concepts to all annotators. Then you can start by assigning a limited amount of documents to all annotators and measure the disagreements using Fleiss' Kappa. When all annotators are comfortable with the guidelines and the agreement is high, you can still use a document overlap, for example, 20 percent, to assess the quality continuously.

Different metrics can be used to measure the annotation quality, such as Fleiss' Kappa or Krippendorff's Alpha (McHugh 2012). These metrics should be used continuously to assess whether the annotation guidelines are clear enough or which cases produce the most annotation disagreements.

As with the model evaluation metrics, the quality of an annotated dataset can not only be quantified by annotation agreement metrics. Often used cut-off points for high agreement scores were formulated by Landis and Koch (1977) as .61 to .80 as substantial agreement and .81 to 1.00 as almost perfect agreement.

After finishing annotating a gold standard, usually, you will have some documents with disagreements between annotators. Depending on the size of your dataset, you can either discard the co-annotated documents or adjudicate the dataset. Adjudication is time-intensive but leads to an unambiguous and high-quality gold standard.

There is no silver bullet method to calculate how many documents are needed for a gold standard - among other considerations it depends on

1) how semantically complex the task is: A skill in a job posting can have many different representations in a job posting; whether or not you need a university degree for a position is linguistically far less complex to detect.

2) how many classes need to be predicted: A binary classifier (e.g. remote job or not) is far less complex to model and evaluate than a classifier with over 1000 categories (e.g., a skill taxonomy).

3) how classes are distributed: Often classes or categories are not distributed evenly in job postings. If a concept (e.g., a skill) appears only in 1 out of 1000 documents, you would need a large multiple of annotated documents to robustly estimate how well your model can extract it.

Last updated