Automated employment contract review is among the abundance of promising applications of Large Language Models (LLMs) in the legal domain. Employment contracts are designed to protect both employers and employees, yet evolving legislation, standardized templates, and individual modifications can introduce legally void clauses. Undetected, these clauses pose financial and reputational risks for employers and may weaken employees’ contractual protections. As a result, contract reviews are often outsourced to law firms, a process that is both costly and time-consuming. An automated, LLM-driven fairness and legality assessment of German employment contracts thus has the potential to streamline reviews, reduce costs, and enhance compliance, ultimately strengthening employees’ legal standing. However, accurately classifying the fairness of contractual clauses requires expert knowledge in labor law, legal precedents, and court rulings—expertise that off-the-shelf LLMs typically lack. To bridge this knowledge gap, Retrieval-Augmented Generation (RAG) or Fine-Tuning can be employed. Yet, RAG depends on high-quality, consistent, and up-to-date legal data, which is often unavailable, while Fine-Tuning is resource-intensive and technically complex, positioning Prompt Engineering as a more accessible and scalable approach. However, manual prompt engineering is labor-intensive and time-consuming, which is why Automated Prompt Optimization (APO) has emerged as a promising solution. Even so, existing APO applications have primarily focused on enhancing instruction-following behavior, leaving open the question of whether APO can implicitly acquire the legal knowledge necessary for fairness classification of clauses in German employment contracts. To address this gap, we develop a novel APO algorithm based on beam search, combining global exploration and local exploitation to optimize category-specific prompts. We evaluate the performance of APO-optimized prompts against a domain-expert-written baseline on a dataset of German employment contract clauses. Our results show that APO yields only a marginal 2-percentage-point increase in global F1 but improves the detection of void clauses by 45%—a significant success given their legal and financial implications. We hypothesize that the limited performance gains stem from algorithmic overfitting, dataset size constraints, class imbalance, and potential mismatches between LLM scale and task complexity. Our findings highlight key challenges in applying APO to downstreams tasks in knowledge-intensive domains such as law and medicine.
1. What is the current state of Automatic Prompt Optimization, and to what
extent do existing approaches consider expert knowledge in prompt optimization?
2. How can existing methods be adapted to design an APO algorithm that optimizes prompts used for classifying clauses from German contracts by implicity learning the required legal expert knowledge?
3. How does this new APO algorithm perform in classifying the fairness of clauses from German employment contracts compared to an unoptimized expert-written prompt as a benchmark?
Name | Type | Size | Last Modification | Last Editor |
---|---|---|---|---|
241007 Kickoff MT David Pauschert.pdf | 1,03 MB | 09.05.2025 | ||
250304 Final Presentation MT David Pauschert .pdf | 1,70 MB | 09.05.2025 | ||
Masters_Thesis_David_Pauschert.pdf | 3,10 MB | 09.05.2025 |