Bachelor's Thesis Tom Schamberger

Last modified Dec 17, 2019

No tags assigned

In the legal domain, many legal documents such as court decisions and contracts are regularly anonymized. This process requires text sequences with high sensitivity to be identified and neutralized to secure sensitive information from third parties. Usually, this process is performed manually by trained employees. Therefore, anonymization is generally considered an expensive and inefficient process.

This thesis proposes a machine learning approach for the automatic identification of sensitive text elements in German legal court decisions and provides an implementation. For this task, different deep neural network architectures based on generally pre-trained contextual embeddings as well as trained word embeddings are evaluated. Because of the lack of non-anonymized data sets, the machine learning models are solely trained on anonymized data. To overcome this limitation, the neural network architectures have been designed in such a way that classification of text elements mainly depends on the surrounding context referred to as "contextual sensitivity classification". Furthermore, a rule-based algorithm has been developed in order to label anonymization placeholders in anonymized legal documents. Finally, the models and algorithms have been evaluated using manually rewritten legal documents.

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
190618 Kickoff_Schamberger_Anonymization_of_German_Legal_Texts.pdf	File	3,81 MB	12.11.2019
191103 Final_Schamberger_Anonymization_of_German_Legal_Texts.pdf	File	4,23 MB	12.11.2019
191107 Bachelor's Thesis Tom_Schamberger.pdf	File	572 KB	12.11.2019