Back to top

Sentence Boundary Detection in German Legal Documents

Last modified Nov 15
   No tags assigned

German legal documents from the German legal system pose challenges to standard NLP techniques for sentence boundary detection (SBD). This applies for the vast majority of document types, such as laws or judgments. Decision makers frequently employ long sentences, complex sentence structures, quotations, citations, and extensive use of parentheses. Citations and lists introduce ambiguities in the meaning of punctuation by using periods and colons that complicate the decision of whether a sentence has ended or not (Savelka et al., 2018). Researchers have noted that lists, with their use of colons and periods in enumerations and of citations, and their combinations of punctuation and alpha-numeric characters, make it harder to tokenize (regulatory) texts and split them into sentences [3]. De Maat and Winkels (2009) observed that lists degraded the performance of their sentence classifier. Savelka et al. have shown already that legal decisions are more challenging for existing SBD systems than for non-legal texts. Existing SBD systems are based on a number of assumptions that do not hold for legal texts, hence their performance is impaired.

SBD is a critical task in many applications such as summarization, machine translation, information extraction, or information retrieval. Presumably, problems in automatically segmenting legal texts into sentences have implications for applying text processing pipelines. Errors in SBD can propagate through higher-level text processing tasks, lowering overall performance. SBD errors are particularly problematic for semantic processing of legal texts that focuses on identifying the inferential roles that sentences play, such as stating legal rules, findings of fact, or a court’s conclusion of law. Suboptimal SBD will likely negatively affect the ultimate applications.

For that reason, the goal of this work is to implement a SBD system for the German language, tailored for the legal domain. The initial task consists of the assessment and selection of different suitable models (sequence labelling, maximum entropy, etc.). Afterwards a dataset needs to be created. For that purpose, the chair provides the actual legal text, but the annotations will be done in a joint effort. After the training of different models, their performance shall be evaluated against a baseline. For that matter, different existing SBD systems, not tailored to a specific domain, will be used.

 

The following prior knowledge is necessary:
- Experience in the development with Java and Python
- Experience in Machine Learning
- Either experience in NLP or the willingness to dedicate time in order to familiarize yourself with NLP
- Preferably (but not necessarily) knowledge of the German language

If you are interested please contact Ingo Glaser with your application, including motivational letter, current CV and transcript of records.

 

Literature:

[1] de Maat E., Winkels R., “A next step towards automated modelling of sources of law”, Pro- ceedings of the 12th International Conference on AI and Law, ACM, p. 31-39, 2009.

[2] Savelka, J., Walker, V. R., Grabmair, M., & Ashley, K. D. (2017). Sentence Boundary Detection in Adjudicatory Decisions in the United States. Revue TAL, 58(2), 21-45.

[3] Wyner A., Peters W., “On Rule Extraction from Regulations.”, JURIX, vol. 11, p. 113-122, 2011.

Files and Subpages

There are no subpages or files.