Back to top

Master Thesis Akshaya Muralidharan

Last modified by Stefanie Urchs (account disabled) Oct 22, 2021
   No tags assigned

Automatic English Text Simplification for Statutes

 

Automatic text simplification is an important application in almost every domain to make text more accessible and interpretable for the general public. This is especially important in the legal domain where government regulations are often written in a hard-to-understand language and it has been shown that using simpler language makes justice more accessible [1]. Text simplification is a low resource task and very few aligned datasets are available for the purpose of training machine learning models. These datasets have either been collected manually [2] or by applying various text alignment techniques on Wikipedia articles [3]. There are several services such as Capito [4] that offer paid services to convert legal text to a simplified form, but there are hardly any publicly available AI assisted services to do this in an automated or semi-automated manner. While corpora containing aligned complex and simplified legal texts do not exist, there are many legal articles online which contain simplified language and could be leveraged to build an aligned corpus using various text alignment techniques [5][6][7].

 

The aim of the thesis is to create an aligned dataset for the purpose of simplification of legal texts. The thesis would involve the following tasks.

  • Find a source of legal texts that is simple to read
  • Evaluate the readability of the text
  • Scrape the text and store in a structured format
  • Find the original laws and acts that correspond to the easy text
  • Scrape and store the original laws in a structured format
  • Survey text alignment techniques
  • Align the simple legal text and original laws on a document/article level
  • Align the simple legal text and original laws on a paragraph/sentence level
  • Survey and find transformer based pre-trained models to fine-tune
  • Measure and compare complexity of collected texts
  • Evaluate performance of existing text simplifcation models on collected legal text
  • Thesis write-up

 

References:

 

[1] Rubab, Iram & Mamona, Yasmin & Khan, & Asgher, Tahira. (2020). Transformation of Legal Texts into Simplified Accounts to Make the Justice Accessible. 

[2] Xu, Wei & Callison-Burch, Chris & Napoles, Courtney. (2015). Problems in Current Text Simplification Research: New Data Can Help. Transactions of the Association for Computational Linguistics. 3. 283-297. 10.1162/tacl_a_00139. 

[3] Coster, William & Kauchak, David. (2011). Simple English Wikipedia: A New Text Simplification Task.. 665-669. 

[4] “Barrierefreie Information: Leicht Verständliche Sprache.” Capito, 4 Feb. 2021, www.capito.eu

[5] Shieber, Stuart & Nelken, Rani. (2006). Towards robust context-sensitive sentence alignment for monolingual corpora. 

[6] Sultan, A.M. & Bethard, Steven & Sumner, Tamara. (2014). Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence. Trans. Assoc. Comput. Linguist.. 2. 219-230. 10.1162/tacl_a_00178. 

[7] Huang, Yonghui & Li, Yunhui & Luan, Yi. (2018). Monolingual sentence matching for text simplification. 

Files and Subpages