Back to top

Master's Thesis Andrei Kreinhaus

Last modified Feb 15, 2023

An Expert-Defined Rule-Based Approach for Generating Vector Representations to Classify Texts

Abstract

Classifying text documents to car components requires that each document is labeled individually to create a vocabulary-specific training dataset for further classification. This process must be done manually and is therefore extremely time-consuming and not flexible. Moreover, this approach is not adjustable to changing requirements such as granularity because the labels remain static and cannot be quickly customized.

To solve the classification problem of customer complaints in the automotive industry, I propose a combined approach based on vector representations of predefined context rules to map a customer problem to predefined classes of vehicle components.

This process involves the creation of a pipeline with several steps such as data preprocessing, keyword extraction, topic modeling, labeling, context rule creation, context window extraction, vectorization of context windows, and assignment of primarily unclassified documents to predefined classes. Various vectorization techniques are tested, ranging from traditional such as tf-idf to state-of-the-art such as BERT.

 

Research Questions:

  1. What are the challenges faced when trying to create structured datasets from unstructured documents?
  2. Which NLP methods can be combined with the domain expertise to facilitate the extrapolation from context rules to training data?
  3. How do these novel methods compare to current methods of unsupervised learning in the context of automotive customer data?

 

Files and Subpages