Transfer Learning for Human Resource Management

Last modified Feb 8, 2022

This project is focused on using Transfer Learning to solve two problems faced by the Human Resource department at Merck.

1) Topic Modeling for Employee Objectives using Word Embeddings

Project Summary

Classical Machine Learning approaches for Natural Language Processing (NLP) work well for most cases but there is always a threshold in terms of accuracy which can never be crossed. On the other hand, state-of-the-art Deep Learning models have managed to surpass that threshold and come very close to human accuracy. This study focuses on performing a Topic Modelling task using Deep Learning models and comparing results with a classical approach known as Latent Dirichlet Allocation. Topic Modeling is an unsupervised Machine Learning technique that groups documents into clusters and finds topic words for each cluster. At Merck Group, Topic Modeling is used to understand the objectives of employees without reading the documents. The general idea is to group employees into clusters based on the similarity of their objectives and find topics which depict the main goals of the cluster’s employees. While an LDA model is able to provide good results, it has certain limitations. First, it works purely on the frequency of words in a document, and all stop-words are removed as a part of the pre-processing. This leads to loss of information in terms of grammatical context and order of words in a sentence. Secondly, it is common for documents to have different words with the same meaning. In an LDA model, these same meaning words would be treated as different. A word embedding model provides a solution to the above-mentioned problems as it is able to retain all grammatical information by processing the sentence as a whole. Additionally, a Word Embedding model provides a multi-dimensional representation for each word which allows capturing the contextual similarity of words with the same meaning. This study demonstrates how using the feature vectors from an Embedding Model, and more specifically, a Sentence Embedding model can provide better results than an LDA model. This study also discusses various Topic word retrieval techniques and concludes that the frequency-based approach and TF-IDF provide the most coherent topic words. Lastly, it also discusses that analytical measures such as the Silhouette score and Coherence score are not suitable for Topic Modeling.

Motivation:

Topic Modeling is an unsupervised Machine Learning technique that groups documents into clusters and finds topic words for each cluster. At Merck Group, Topic Modeling is used to understand the objectives of employees without reading the documents. The general idea is to group employees into clusters based on the similarity of their objectives and find topics which depict the main goals of the cluster's employees.

While an LDA model is able to provide good results, it has certain limitations. First, it works purely on the frequency of words in a document, and all stop-words are removed as a part of the pre-processing. This leads to loss of information in terms of grammatical context and order of words in a sentence. Secondly, it is common for documents to have different words with the same meaning. In an LDA model, these same meaning words would be treated as different.

A word embedding model provides a solution to the above-mentioned problems as it is able to retain all grammatical information by processing the sentence as a whole. Additionally, a Word Embedding model provides a multi-dimensional representation for each word which allows capturing the contextual similarity of words with the same meaning.

This study demonstrates how using the feature vectors from an Embedding Model, and more specifically, a Sentence Embedding model can provide better results than an LDA model. This study also discusses various Topic word retrieval techniques and concludes that the frequency-based approach and TF-IDF provide the most coherent topic words. Lastly, it also discusses that analytical measures such as the Silhouette score and Coherence score are not suitable for Topic Modeling.

Problem Statement:

Employees at work write their objectives twice a year which are read by the HR department as well as the people in managerial positions. Given the large number of employees, it is a very time consuming project and the purpose of this project is to use NLP technqiues to automate it. The general idea is to cluster similar employees based on their objectives and derive the main topics for each cluster.

Research Questions:

RQ1: Could using embedding vectors lead to better results than Latent Dirichlet Allocation model?

RQ2: If the word embedding models are able to provide better results, then which type of embedding model is better suited?

RQ3: Could using a traditional algorithm such as LDA in tandem with the Embedding models provide better results?

2) Evaluating Text Similarity Techniques for Matching Personal Employee Objectives

Project Summary

The application of matching employees based on their personal objectives is made difficult in multinational companies that have many offices spread worldwide and thousands of employees. This is due to the fact that manually comparing the personal objectives of thousands of employees, and extracting similar candidates is time and effort consuming. The problem becomes more difficult when employees change their personal objectives on a yearly basis. In this paper, we present a system to automate the process of matching employees based on their personal objectives. The system extracts similar employees based on the semantic similarity of their personal objectives encoded in word embeddings. Usability evaluations show that the system improves the process of matching employees based on their similar objectives in terms of reducing the time consumed and finding results that are difficult to be obtained by human matching.

Motivation:

TBD.

Problem Statement:

Searching for similarities in different pieces of text has always been an interesting topic in machine learning. The problem itself is composed of multiple subproblems, starting from finding the best representation of a word, moving to aggregating word representations to sentences, and finally defining the notion of similarity between these representations. While there are multiple ways to solve each subproblem, the final combination of individual solutions highly depends on the problem at hand. Our concrete problem for this guided research is finding employees with similar personal objectives from a database of employee objectives obtained from the HR department of MERCK.

Research Questions:

TBD.