Back to top

Master's Thesis Anum Afzal

Last modified Feb 8, 2021

Analysieren von Mitarbeiterzielen mithilfe der Themenmodellierung mit Worteinbettung


 

Classical Machine Learning approaches for Natural Language Processing (NLP) work well for most cases but there is always a threshold in terms of accuracy which can never be crossed. On the other hand, state-of-the-art Deep Learning models have managed to surpass that threshold and come very close to human accuracy. This study focuses on performing a Topic Modelling task using Deep Learning models and comparing results with a classical approach known as Latent Dirichlet Allocation.

Topic Modeling is an unsupervised Machine Learning technique that groups documents into clusters and finds topic words for each cluster. At Merck Group, Topic Modeling is used to understand the objectives of employees without reading the documents. The general idea is to group employees into clusters based on the similarity of their objectives and find topics which depict the main goals of the cluster's employees.

While an LDA model is able to provide good results, it has certain limitations. First, it works purely on the frequency of words in a document, and all stop-words are removed as a part of the pre-processing. This leads to loss of information in terms of grammatical context and order of words in a sentence. Secondly, it is common for documents to have different words with the same meaning. In an LDA model, these same meaning words would be treated as different.

A word embedding model provides a solution to the above-mentioned problems as it is able to retain all grammatical information by processing the sentence as a whole. Additionally, a Word Embedding model provides a multi-dimensional representation for each word which allows capturing the contextual similarity of words with the same meaning.

This study demonstrates how using the feature vectors from an Embedding Model, and more specifically, a Sentence Embedding model can provide better results than an LDA model. This study also discusses various Topic word retrieval techniques and concludes that the frequency-based approach and TF-IDF provide the most coherent topic words. Lastly, it also discusses that analytical measures such as the Silhouette score and Coherence score are not suitable for Topic Modeling.

Files and Subpages