Master's Thesis Moritz Steigerwald

Last modified May 9

No tags assigned

Towards Scalable Domain-Specific Document Annotation: A Semantic Archetype-Driven Framework

Abstract

The emerging demand for NLP applications in specialized, domain-specific contexts — where large annotated corpora are not readily available — stresses the need for frameworks that transform un- structured texts into accessible, annotated datasets using domain expertise as opposed to manual an- notation. This work proposes a semantic archetype-driven approach that leverages rule texts as both domain-guiding definitions and training exemplars for classification. By combining term-matching similarity scores with dense embedding representations, the framework enables a modern, robust clas- sifier to accurately map unstructured text onto high-level semantic categories. The architecture prior- itizes interpretability by explicitly aligning classifier predictions with human-defined rules, creating a human-in-the-loop workflow in which experts iteratively refine rule texts and model behavior to capture domain-specific semantics. Although previous efforts have focused on distilling context windows and deriving rules, this work takes the next step by building the complete classification pipeline and evaluating its ability to adapt. In particular, we use a reinforcement learning (RL) approach (e.g., PPO, GRPO) to iteratively refine archetypes and classifier parameters. A reward signal based on both classification confidence and expert validation adjusts archetype relevance thresholds and model behavior, resulting in a closed-loop system that adapts domain shifts without re-annotation. Experiments in domains with limited annotated data (e.g., regulatory documents, academic preprints) benchmark the framework against zero-shot LLMs and supervised baseline models, focusing on accuracy, computational efficiency, and ability to adapt to evolving terminologies. Blending rule-based interpretability with modern NLP architectures and adaptive RL, this work promotes expert-in-the-loop systems that democratize AI in specialized fields. The framework em- phasizes human-centric scalability by empowering domain experts to mine corpora through semantic definitions — as opposed to tedious manual annotation — and provides a blueprint for transforming unstructured text into structured knowledge, even under resource-constrained conditions.

Research Questions

How can semantic archetypes based on context rules be leveraged to classify and annotate domain-specific documents?
Can a reward-based feedback system boost both classification accuracy and archetype quality?
How does this archetype-driven framework compare to supervised models and zero-shot LLMs in terms of accuracy and resource utilization across different domains?

Incoming references

Files and Subpages

There are no subpages or files.