Bachelor's Thesis Markus Löhde

Last modified Jun 10, 2024

bachelorthesis

Abstract

The objective of this thesis is to enhance the extrapolation step of the CD4AI project.
CD4AI addresses a major challenge arising from two important facts: firstly, 80% of data generated is unstructured, with a substantial amount being in text form; second, the task of accurately annotating this data through human domain experts is both time-consuming and, consequently, very costly. It's clear that developing a system that is faster while maintaining accuracy could unleash considerable potential.

At its core, the extrapolation step of CD4AI tackles a multiclass, multilabel text classification task. Depending on the availability of labeled examples, the literature distinguishes between three primary approaches for this type of task: zero-shot classification, weakly supervised classification, and few-shot classification. State-of-the-art methods for each approach include NLI-based text classification with LLMs as proposed by Gera et al. (2022) for zero-shot classification, a hybrid technique combining pseudo-document generation and self-training as suggested by Meng et al. (2018) for weakly supervised classification, and few-shot tuning utilizing Sentence Transformers as introduced by Tunstall et. al (2022) for few-shot classification.

In our case, the prior step in the CD4AI pipeline provides us with context rules for each class that act as representative class archetypes. These context rules do not represent labeled examples; instead, they aim to be more abstract, capturing both the syntactic and semantic essence of the class. Therefore, our work aligns most closely with the field of weakly-supervised text classification, although context rules represent a new kind of seed knowledge. As a result, none of the current state-of-the-art techniques fit our use case exactly.

We therefore explore five distinct approaches in this thesis, all of which attempt to leverage the context rules to predict the labels of documents. We begin with simple methods that focus on the syntactic similarities between the rules and documents of the corresponding class. Then, we progress to methods that focus on the semantic similarities by using embedding models like Sentence Transformers. Finally, we fine-tune the embedding models as proposed by Tunstall et. al (2022) to tailor them specifically to our dataset.

We evaluate these methods based on widely-used metrics and compare them to state-of-the-art solutions from the aforementioned research fields. Our findings reveal that embedding models fine-tuned on labeled examples can achieve accuracy comparable to proprietary, state-of-the-art LLMs like GPT-4, while using only a fraction of the computational resources. However, we also find that our current context rules do not effectively approximate these labeled examples. Based on our findings, we offer recommendations for further research, such as using the context rules as seed knowledge for the method proposed by Meng et al. (2018) to create better approximations of labeled documents.

Research Questions

Research Questions:

What are the state-of-the-art approaches for a multilabel classification of large, domain-specific text corpora?
What is the most efficient and accurate approach for leveraging context-specific class archetypes for a multilabel classification of large, domain-specific text corpora?
How can the efficiency and accuracy of a system designed to annotate large, domain-specific text corpora be evaluated?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
240304 Markus Löhde Kick-Off.pdf	File	1,75 MB	03.06.2024
240515_Markus_Löhde_Bachelor_Thesis.pdf	File	1,79 MB	03.06.2024
240527 Markus Löhde Final Presentation.pdf	File	2,05 MB	03.06.2024