Due to the growing number of engineering disciplines and the strong increase in digital communication of the results, engineers nowadays are faced with an overwhelming amount of information when trying to solve their technical challenges. This problem can be addressed with new tools, which on one hand can cope with the continuous increase of information and on the other hand understand the needs of an individual engineer. For tasks such as text classification, topic modeling, and search these tools rely on advanced natural language processing (NLP) models and algorithms. Bidirectional Encoder Representations from Transformers (BERT) is a popular model, which achieved state-of-the-art performance on a number of natural language understanding tasks. One of its main advantages is its suitability for transfer learning: it can be trained on a large corpus of general texts and then fine-tuned for better performance on a smaller number of texts from a specific domain. The goal of this thesis is to evaluate different approaches to training domain-specific BERT models for the engineering domain. Such strategies as training from scratch, fine-tuning of a general model on domain-specific texts, training as a part of a downstream task, and their combinations should be compared in terms of their performance on a set of predefined tasks. Special emphasis should be put on the differences between the approaches in terms of their ability to cope with out-of-vocabulary words and the easiness of vocabulary extension. If proven to be beneficial and possible, the existing datasets, used for training and evaluation, should be extended through manual labeling.
Name | Type | Size | Last Modification | Last Editor |
---|---|---|---|---|
Final_SergiiPoluektov.pdf | 3,52 MB | 03.11.2021 | ||
Kickoff_SergiiPoluektov.pdf | 2,34 MB | 03.11.2021 |