Back to top

Bachelor's Thesis Georg Bonczek

Last modified Jan 4, 2018

Conceptualization and Implementation of a Rule-based Workbench for Textual Pattern Annotation

 

Abstract

With the clear advantages and capabilities of deep learning and machine learning in general, research on information extraction is nowadays seldom based on explicit, manually crafted rules. As every technology, both approaches have their merits and detriments. One of the advantages of rule-based systems is the possibility of translating domain knowledge directly into information extraction systems.

It is rarely the case that rule engineers (software engineers) have the same comprehensive domain knowledge as domain experts. On the other hand, current tools and rule languages expect the domain expert to have a substantial amount of knowledge in software engineering and the underlaying framework. They require deep awareness about all involved parts of the system and do not abstract away the technicalities, as they focus on providing feature richness to system engineers, rather than an end-user enabled process of rule development. The main question this thesis addresses is how existing tools, workbenches and Integrated development environments (IDEs) can be improved to support a complete workflow beginning with the planning of a rule-based information extraction system through implementation and evaluation.

We first describe the current approaches to workbenches and IDEs, comparing them against each other regarding the richness of features pertinent to a domain specific workflow in rule development. Additionally, we describe use-cases of information extraction tasks in the legal domain. This constraint on a specific domain allows to make assumptions about the usage of the tools. Based on the gained understanding about the requirements and fundamentals, multiple tools are proposed addressing these points. These components are integrated into the Lexia system to leverage the existing capabilities and infrastructure.

Multiple artifacts are created that are embedded in the Lexia system. Firstly, the steps of the MATTER workflow are fully modeled through multiple web-based views and APIs. The second main artifact is a rule language which provides an easy and accessible syntax tailored to the needs encountered when developing systems for information extraction of high level linguistic entities on legal documents. It enforces a strong separation of concerns between software engineers and domain experts through a language design that relies on extensions written in the Java programming language.

The rule language and the implemented workbench are an important step in the direction of domain specific workbenches for rule-based information extraction. We find that the artifacts provide less functionality than more generic IDEs, but have many advantages as they are embedded in a domain specific environment such as Lexia. Hence, workflows are more integrated and uninterrupted, creating a workbench for domain experts that focuses on knowledge engineering rather than system engineering.

Files and Subpages

Name Type Size Last Modification Last Editor
Final_Bonczek.pdf 9,95 MB 15.02.2018
Kickoff_Bonczek.pdf 8,66 MB 15.02.2018