Master's Thesis Thien-An Huynh

Last modified Aug 11, 2023

masterthesis

Formalizing and automating regulatory document versioning

Abstract:

Legal consolidation is the act of applying modificatory provisions to a target document in the correct order and manner. In current practice, this process is very time-consuming and error-prone, in large part due to the fact that, at the moment, this process is performed exclusively by hand.

In the current state of research, there have been numerous attempts to automate this process by way of leveraging natural language processing techniques and machine learning models in order to automatically parse these modificatory provisions (often only provided in unstructured natural language formats) into an automatically executable formalized format. However, researchers have not yet been able to come up with a reliable method of semantic information extraction or annotation for this particular use case and the problem of automatically parsing these instructions remains largely unsolved. Most of these research projects also purely focus on the information extraction part of the consolidation process, only assessing the correctness of the extracted information and choosing to forgo the automatic application of these instructions to the targetted document. As a final note on the current research, there currently exists no research on regulatory documents in the English language in the context of automatic legal consolidation. In fact, the vast majority of research papers in this field are conducted by Italian researchers.

In this thesis, the goal is to research and develop requirements to formalize modificatory provisions. This format will be based on existing research but specifically focused on amending documents released by the UNECE as well as the federal government of the US. This format is designed to be human-readable and -writable as developing a system that is able to automatically parse natural language text into this format would increase the scope of this thesis to an unreasonable level. In addition to this, a consolidation engine reference implementation is developed, that is able to apply a subset of these modificatory provisions automatically to a target document.

Lastly, both the formalization format and the consolidation engine were evaluated by manually converting unstructured natural language modificatory provisions into this machine-executable format and feeding it to the consolidation engine reference implementation. The resulting outcome was compared to consolidated documents, consolidated by regulatory experts at Certivity, serving as the ground truth. The machine-executable format was evaluated in terms of its expressiveness and whether it is able to accurately model all modificatory provisions from the UN and US data sets. The consolidation engine reference implementation was evaluated on the correctness of the resulting consolidated document, provided that the modificatory provision has been accurately converted. Lastly, interesting edge cases and modificatory provisions are documented and categorized, which could cause issues for a fully automated consolidation system.

Research Questions:

RQ1: What is the minimum set of consolidation engine operations needed to model all modificatory provisions in the data set?

RQ2: What is the minimum set of metadata fields needed in the representation formats of base documents and modificatory provisions in order to perform automatic consolidation?

RQ3: How accurate are the automatic change applications performed by the consolidation engine reference implementation?

Conclusion:

In conclusion, the evaluation of both the formalization format and the consolidation engine demonstrated the potential of our approach in automating basic replacements, insertions, and deletions in regulatory documents.

However, the research also identified several challenges, often caused by implicit information, unusual edge cases, and mistakes made by regulatory bodies, currently preventing the full automation of the legal consolidation process. Despite these limitations, this thesis contributes to the ongoing efforts of automating legal consolidation by documenting difficult edge cases and highlighting promising areas for future research. By addressing the identified challenges and further refining the formalization format and consolidation engine, it is possible to significantly reduce the time and effort spent on legal consolidation tasks while minimizing errors in the process.

In light of these findings, future work should focus on improving the semantic information extraction and annotation, as well as incorporating this thesis' findings on edge cases and potential issues into the development of a more reliable and accurate solution. Furthermore, extending the scope of research beyond the Italian context and exploring the applicability of this approach to other languages and legal systems will be beneficial in contributing to the global advancement of fully automated legal consolidation.

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
Thien-An Huynh Kickoff.pptx	File	7,03 MB	11.08.2023
Thien-An Huynh MA final-presentation.pptx	File	8,14 MB	11.08.2023
Thien-An Huynh MastersThesis.pdf	File	7,51 MB	11.08.2023