Master's Thesis Maria Nakhla

Last modified May 9

No tags assigned

Abstract

The proliferation of unstructured text data has created an urgent need for efficient and effective methods to transform these data into structured and annotated datasets for AI applications. This general objective involves a data distillation step, which, in this study, focuses on the convergence of semantically relevant context windows into coherent archetypes. The generated archetypes, representing a domain or subdomain, are subsequently used to classify text contexts instead of relying on millions of general domain contexts, which may include irrelevant information or duplications.

The proposed approach involves collecting datasets from crucial domains, including Technology, Business, Sports, Politics, and Entertainment. First, context windows are extracted for each domain using a previously implemented pipeline. Then, a recursive hierarchical clustering approach is employed to group relevant subcontexts within each domain. This clustering step paves the way for experimenting with various large language models (LLMs) to generate an archetype per cluster. Prompt engineering techniques are explored to refine the retrieval of high-quality archetypal LLM outputs, with numerous iterations ensuring that the LLM-generated outputs adhere to the desired format. Edge cases are handled carefully to ensure proper parsing. By the end of this process, each cluster maps to a single archetypal rule, and the combination of all clusters’ archetypes results in a comprehensive archetype set for a specific domain.

Various evaluation techniques were utilized, including semantic search, fine-tuning text classifiers, conducting surveys, and re-clustering generated domain archetypes. These methods provide a consistent evaluation framework to assess archetype completeness, disjointness, and insights into training classifiers on full-domain text versus domain archetypes.

The results of this study demonstrate that training classifiers on domain archetypes outperforms training on full text. Furthermore, although Meta-Llama-3-8B-Instruct model is considered to be a light-weight LLM, it surpasses other bigger models in contextual knowledge distillation and archetype generation. There are a few limitations such as length bias in human evaluations, inconsistencies between subjective human ratings and classification metrics, and dataset scope. Longer generated archetypes were often favored by the survey participants, possibly conflating verbosity with quality. Results may not generalize beyond coarse-grained news domains. Future work should target nuanced, multilingual corpora.

Overall, this research contributes to the development and evaluation of an automated, domain-expert-driven approach to archetype creation. This method accurately captures domain-specific knowledge and enhances the quality of subsequent text classifications. The research successfully fulfills its goal of transforming unstructured text data into structured and annotated datasets, ultimately supporting the creation of more effective AI applications.

Research Questions

What are the most effective methods to accurately capture semantics from related text chunks and distill them into one coherent text?
What are the current prompt engineering techniques crucial for generating a well-defined, and coherent class archetypes which effectively incorporate domain expertise?
How can the quality and consistency of the generated class archetypes be evaluated, and ensured?
Does the downstream utility of generated archetypes achieve similar results to manually annotated datasets in domain-specific tasks?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
250113 MT-KickOff-Presentation.pptx	File	3,99 MB	09.05.2025
250428 MT-Final-Presentation.pptx	File	6,63 MB	09.05.2025
250501 Maria_Nakhla_Thesis.pdf	File	3,79 MB	09.05.2025