Evaluation and Quality Control
Evaluating enrichment and extraction models.
Last updated
Evaluating enrichment and extraction models.
Last updated
Model evaluation is a crucial part of developing models for OJA enrichments because it helps you understand how well your model is performing and whether it is meeting the goals that you have set for it. To evaluate machine learning and extraction models, we need a gold standard (Gold Standard Annotation and Quality) that experts curated. Against the gold standard, we can calculate different metrics to evaluate the model performance (Evaluation Metrics).
Which evaluation metrics you use depends heavily on the data, the purpose, and the context of your models. In general, the first step is to calculate a confusion matrix where predicted values and actual values are compared.
The following is an example of a confusion matrix for a classification algorithm that solves the task of classifying zones in an OJA. Each cell is the number of examples for a specific actual vs. predicted value.
From the confusion table, you would usually calculate the number of True Positives, False Positives, and False Negatives, which can then be used to calculate the following metrics:
Accuracy: This is the most straightforward metric and measures the proportion of correct predictions. However, it can be misleading when the distribution of classes is not balanced.
Precision: This is a measure of the false positive rate and is defined as the number of true positives divided by the sum of the true positives and false positives. Intuitively, it represents the expectation "How many cases that were found were actually correct?"
Recall: This is a measure of the false negative rate and is defined as the number of true positives divided by the sum of the true positives and false negatives. Intuitively, it represents the expectation "How many cases that actually exist were correctly found?"
F1 score (or F-Score): This is the harmonic mean of precision and recall and is defined as 2 * (precision * recall) / (precision + recall).
Confusion matrix: This table is used to visualize the performance of a classifier by showing the number of true positives, true negatives, false positives, and false negatives.
ROC curve: This is a plot that shows the true positive rate on the y-axis and the false positive rate on the x-axis and is used to visualize the trade-off between these two rates for different classification thresholds.
Usually, you would expect that precision and recall are reported. When these metrics are used, it is essential to distinguish between micro- and macro-scores. The micro-score tells you the performance over all predictions, while the macro-score is the average of the predictive performance by each class or label. Whenever possible, both scores should be calculated and reported.
Reporting results - especially for machine learning models - is an important part of the analysis cycle. It is important to communicate the following points so that others can better interpret and use the results:
Dataset: Provide information about the dataset used. This should include details about its size, nature, and source. Highlight if any pre-processing steps (like cleaning, normalization, feature extraction, etc.) were undertaken.
Methods: Explain the methodology adopted. Describe the model used, its configuration, and why it was chosen over other possible models. Include details about the algorithm, and the training and validation process.
Performance Metrics: Present and explain the metrics used to evaluate the model's performance (accuracy, precision, recall, F1-score, etc.). Detail how these metrics were calculated, why they were chosen, and what they indicate about the model's performance. An overview of metrics can be found above Evaluation Metrics.
Results: Present the results obtained from the model. This can include predictive accuracy, feature importance, confusion matrices, or other relevant findings. Make sure to clarify what these results imply in the context of the study.
Limitations: Discuss the limitations of your study. This might encompass constraints with the data, model, computational resources, or the inherent complexity of the problem.
Bias: Address any potential biases present in the data or the model. This could be in data collection, data representation, or the model's bias-variance trade-off. Acknowledging bias helps in assessing the fairness and generalizability of the model.
A gold standard is a data set of documents for which human annotators have added labels. This kind of dataset is usually used to train or fine-tune machine learning models and evaluate them.
The development of a gold standard dataset starts with a taxonomy and documents. Depending on the model purpose, the documents are annotated by experts or trained individuals who assign labels to the documents - on the document level or the token level.
To ensure high data quality, annotation guidelines (or coding manuals) are formulated, which help annotators decide on ambiguous cases and explain the concepts to be annotated in more detail.
After finishing annotating a gold standard, usually, you will have some documents with disagreements between annotators. Depending on the size of your dataset, you can either discard the co-annotated documents or adjudicate the dataset. Adjudication is time-intensive but leads to an unambiguous and high-quality gold standard.
Taxonomy: Explain which taxonomy was used and why. If the taxonomy was enriched indicate which assumptions were applied and how they influence the results. See more about data standards and metrics in the section .
A helpful tool for reporting models is the so-called model card. Huggingface provides a good overview of the : The presented tools and processes can be "contextualized with regard to their focus (e.g., on which part of the ML system lifecycle does the tool focus?) and their intended audiences (e.g., who is the tool designed for?)".
An example for a model score card is the by the Bertelsmann Stiftung and &effect data solutions (Müller 2023).
Different metrics can be used to measure the annotation quality, such as Fleiss' Kappa or Krippendorff's Alpha (). These metrics should be used continuously to assess whether the annotation guidelines are clear enough or which cases produce the most annotation disagreements.
As with the model evaluation metrics, the quality of an annotated dataset can not only be quantified by annotation agreement metrics. Often used cut-off points for high agreement scores were formulated by as .61 to .80 as substantial agreement and .81 to 1.00 as almost perfect agreement.