Master Thesis Akshaya Muralidharan

Last modified by Stefanie Urchs (account disabled) Oct 22, 2021

No tags assigned

Automatic English Text Simplification for Statutes

Automatic text simplification is an important application in almost every domain to make text more accessible and interpretable for the general public. This is especially important in the legal domain where government regulations are often written in a hard-to-understand language and it has been shown that using simpler language makes justice more accessible [1]. Text simplification is a low resource task and very few aligned datasets are available for the purpose of training machine learning models. These datasets have either been collected manually [2] or by applying various text alignment techniques on Wikipedia articles [3]. There are several services such as Capito [4] that offer paid services to convert legal text to a simplified form, but there are hardly any publicly available AI assisted services to do this in an automated or semi-automated manner. While corpora containing aligned complex and simplified legal texts do not exist, there are many legal articles online which contain simplified language and could be leveraged to build an aligned corpus using various text alignment techniques [5][6][7].

The aim of the thesis is to create an aligned dataset for the purpose of simplification of legal texts. The thesis would involve the following tasks.

Find a source of legal texts that is simple to read
Evaluate the readability of the text
Scrape the text and store in a structured format
Find the original laws and acts that correspond to the easy text
Scrape and store the original laws in a structured format
Survey text alignment techniques
Align the simple legal text and original laws on a document/article level
Align the simple legal text and original laws on a paragraph/sentence level
Survey and find transformer based pre-trained models to fine-tune
Measure and compare complexity of collected texts
Evaluate performance of existing text simplifcation models on collected legal text
Thesis write-up

References:

[1] Rubab, Iram & Mamona, Yasmin & Khan, & Asgher, Tahira. (2020). Transformation of Legal Texts into Simplified Accounts to Make the Justice Accessible.

[2] Xu, Wei & Callison-Burch, Chris & Napoles, Courtney. (2015). Problems in Current Text Simplification Research: New Data Can Help. Transactions of the Association for Computational Linguistics. 3. 283-297. 10.1162/tacl_a_00139.

[3] Coster, William & Kauchak, David. (2011). Simple English Wikipedia: A New Text Simplification Task.. 665-669.

[4] “Barrierefreie Information: Leicht Verständliche Sprache.” Capito, 4 Feb. 2021, www.capito.eu

[5] Shieber, Stuart & Nelken, Rani. (2006). Towards robust context-sensitive sentence alignment for monolingual corpora.

[6] Sultan, A.M. & Bethard, Steven & Sumner, Tamara. (2014). Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence. Trans. Assoc. Comput. Linguist.. 2. 219-230. 10.1162/tacl_a_00178.

[7] Huang, Yonghui & Li, Yunhui & Luan, Yi. (2018). Monolingual sentence matching for text simplification.

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
210517 Muralidharan Master Thesis Kickoff.pptx	File	1,14 MB	17.05.2021
210927 Muralidharan Master Thesis final presentation.pdf	File	1,30 MB	27.09.2021
211015 Muralidharan Master Thesis.pdf	File	2,73 MB	19.10.2021
Master Thesis kickoff presentation.pdf	File	874 KB	17.05.2021