Master's Thesis Georg Bonczek

Last modified Jan 16, 2020

No tags assigned

Exploring Challenges and Opportunities for Increasing Reuse in NLP Application Development

Decades-long research and practical experience have made enterprise information systems a well-understood field. Moving from mostly physical (e.g. paper-based) information exchange to executing these processes in digital information systems allowed large corporations to optimize information flow and scale them to sizes which were previously hardly achievable.

Natural Language Processing provides a similar possibility for increased digitalization and automatization by gaining insights into documents (e.g. text or speech) generated by humans.

As many information systems handle large volumes of documents that are processed and manipulated by humans, automated extraction of some relevant information is a valuable addition to such systems.

However, As the IT landscape of an enterprise consists of many different information systems with different architectures and interfaces, development of individual enhancements for each information systems becomes inefficient.

In this master`s thesis, we first design and implement a generic architecture for an NLP platform that makes It possible to process a large variety of documents in a structured way, with a focus on reusability of the modules responsible for the analysis of these documents. This platform should be able to integrate into a wide variety of information systems without heavy adaptation efforts.

In a next step, we describe mid-scale information systems with a special emphasis on collaborative generation of data. Users can annotate and enrich data using a graphical interface, building upon the information produced by others. On one side, the users should be supported by machine-generated annotations, on the other hand the manually generated information can be used to train models, evaluate them or for active machine learning.

After a literature survey on NLP focused reference architectures and generic ways to model and structure documents, we describe an architecture with the capabilities of a traditional information system (focusing on the aspects relevant to NLP, no generic enterprise IS architecture will be attempted), with the addition of natively supporting the requirements of an interactive, collaborative, NLP-based workflow. This architecture is then used to (partially) implement an information system for the management of world-wide legislation on vehicle type approval.

The legislation is imported from a range of different sources, analyzed by a range of NLP services and made available to users that collaboratively derive company-internal guidelines for the design of vehicles.

The main challenges that we will focus on are:

A reusable and capable platform for NLP services

Being able to integrate NLP into a wide variety of text-intensive information systems is a compelling reason to build a platform that allows heavy reuse of components geared towards natural language processing. However, being able to process a wide variety of differently structured documents in a scalable way, requires a well-defined environment for adding new components, as well as using them.

Modeling of legal documents

As the created architecture for the information system should be applicable for other systems as well. Thus, creating a flexible data schema is integral to a successful design. Depending on a literature review, either an existing schema, model or meta-model is chosen, or a self-designed approach is implemented.

Information retrieval in world-wide homologation legislation

To be able to draw meaningful conclusions about the expressiveness of the designed system, we have to extract a wide variety of information relevant to the homologation process. We extract different types of entities which are selected with the help of domain experts. These entities are selected in such away, that they test different parts of the system. Some information might be mostly user-facing and only relevant for faster navigation, while other information is further processed and used to match documents to technical specification stored in databases. The complete list of information and requirements towards the system will be gathered together with a wide variety of domain experts and stakeholders in leading roles in the design, development and maintenance process of the mentioned information system.

This structure is reflected in the main research questions:

- Which steps in NLP pipelines are common, and how can they be made reusable?

- How can NLP be effectively used in collaborative, interactive environments?

- How does the integration of NLP affect document modeling and architecture in mid-scale information systems?

- Which data needs to be extracted, and how can it be used for an efficient preparation of homologation legislation?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification	Last Editor