Back to top

Guided Research Maria Nakhla

Last modified Apr 18

Abstract

 

The extraction of domain-specific keywords from textual data, a critical application within Natural Language Processing (NLP), has gained substantial importance in the contemporary data-driven landscape. The research concern is that there is a paramount chance of extracting keywords, which deviate from the core domain meaning. This is due to possibility of nth child keywords relations being introduced, which do not directly relate to the main domain goal. Thus, further keyword filtering is a crucial step to guarantee all keywords actually belong to the target domain. The methodology utilized consists of two main steps. The first one is clustering; in this phase multiple clustering techniques are investigated, and specially using a convex hull approach. Then comes the second step to get rid of outliers.Various techniques have been tested such as Isolation Forest and Local Outlier Factor. Text-Embeddings similarity measuring techniques with utilization of WordNet and ConceptNet are also involved as a final step. Furthermore, the utilized techniques are evaluated using recall, precision and F1-score, as well as with domain experts help for further evaluations. The results are quite promising using the convex hull clustering approach. The hybrid method combining three powerful tools which are clustering, outlier detection, and semantic similarity has proved its ability of getting rid of irrelevant class-specific keywords.

Research Questions

 

  1.  Which clustering approaches currently exist that can be utilized to cluster keywords based on relevance to a class?
  2. What are possible outlier detection methods that could also help to achieve a more class-specific keyword set?
  3. Could different methods be combined for better results?
  4. In which ways can the resulting filtered keywords set be evaluated?

Files and Subpages

Name Type Size Last Modification Last Editor
240129 Nakhla Kickoff GR.pptx 1,83 MB 18.04.2024
240415 Nakhla GR Final.pptx 2,06 MB 18.04.2024
240415 Nakhla GR Report.pdf 377 KB 18.04.2024