Back to top

Guided Research Shabnam Sadegharmaki

Last modified Apr 12, 2021

Exploring the Effect of Data Augmentation on the Quality of Text Classification

Untersuchung der Effekte von Data Augmentation auf die Qualität von Textklassifikation

 

Text classification provides a wide range of solutions to many real-world applications. These applications vary from legal advising to financial risk management. By applying machine-learning techniques, models are trained to distinguish the legal norms in a contract or legislation context [1]. Knowing the context of a contract, the train model can advise about quality and coverage of a legal text.

Another scenario, we are interested in, is monitoring the financial state of a company by analyzing the news regarding the company or the industry. Text classification eases the task of financial experts by detecting critical news, affecting the future of a company. Otherwise, they have to skim many not too short articles to monitor all the events. Developing an early warning system serves insurance companies such as Allianz to avoid loss and to investigate effectively.

To accomplish these tasks, methods such as classification are commonly used [2]. However, classification techniques as part of the supervised learning require a large quantity of training data, labeled by experts of the specific area. Text data can be more aggravating in this case, as the experts must read a volume of a text to conceive the desired semantic. Therefore, labeling a text data can be labor-intensive and consequently rare.

The aim of this guided research is to improve the quality of the text classification in the absence of adequate training data. During our preliminary research, we found two groups of methods leading to this goal:

First, text data augmentation employs the existing labeled dataset and by transforming them, it provides a distinct set of data with the same labels. In other words, it increases the amount of training data by adding a variation to the existing dataset. Originally, it comes from the Image processing research literature. Sun, X. & He, J. have adapted the idea to the text data. Moreover, they have shown that increasing the amount of training data by text augmentation can improve the performance of classification [3].

The second method, we will investigate, is graph-based semi-supervised learning. Semi-supervised approaches consume both labeled and unlabeled data to tackle the classification problem [4]. Hence, they are more adaptable when there is a lack of labeled data. Among the semi-supervised approaches, Graph-based methods fit more naturally to NLP solutions [5].  Moreover, [6] presented a scalable approach to graph-based SSL, which we will follow during this research.

During this research, we will implement both methods and compare the results of text classification on two data sources. The first dataset includes 600 sentences from legal norms; classification with regard to the semantic type is done based on a functional taxonomy. The second dataset consists of 1216 financial news articles in German, which Allianz financial experts have labeled based on whether the article contains a critical information or not.

This research will assess the improvements in classification models with the presence of the two mentioned techniques compared to classical supervised text classification. Then, the performance of the applied methods will be compared across the previously introduced datasets. Comparison of them in different contexts allows us to investigate their generalization as a cross-domain solution.

 

Time Plan

Our estimate for the time and effort of conducting this research is as the following:

Research

80 hours

end of Oct

Implementation

120 hours

21st Dec

Analysis of the results

60 hours

15th Jan

Document & Presentation

40hours

Feb

 

References

[1] Waltl, B.; Bonczek, G.; Scepankova, E.; Matthes, F. (2018). Semantic Types of Legal Norms in German Laws: Classification and Analysis Using Local Linear Explanations. Artificial Intelligence and Law.

[2] Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.

[3] Sun, X., & He, J. (2018). A novel approach to generate a large scale of supervised data for short text sentiment analysis. Multimedia Tools and Applications, 1-21.

[4] Zhu, X. (2006). Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3), 4.

[5] Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78-94.

[6] Ravi, S., & Diao, Q. (2016, May). Large scale distributed semi-supervised learning using streaming approximation. In Artificial Intelligence and Statistics (pp. 519-528).

 

*The guided research will be done in cooperation with Allianz Company and Chair of Software Engineering for Business Information Systems (sebis).

 

Files and Subpages

Name Type Size Last Modification Last Editor
Final Presentation v1.pptx 1,57 MB 14.05.2019
Kick-off presentation.pptx 1,53 MB 16.10.2018