Back to top

Master's Thesis Robin Geibel

Last modified Feb 15, 2023

Evaluating Approaches to Overcome the Input Size Limitations of Transformer-Based Language Models on Patent Documents

 

Abstract:

 

Large pretrained language models have made substantial progress in their ability to encode a text sequence’s syntactic and semantic information. Such language models thereby enable the construction of powerful machine learning models with little addi- tional training data for downstream tasks ranging from question answering systems to text classification. The key to their pervasive success is the Transformer architecture they are based on. The Transformer and its self-attention mechanism allow the tokens of an input sequence to be processed in parallel. This parallelism enables the degree of pretraining necessary to achieve state of art results on downstream tasks. However, the Transformer’s memory and compute requirements grow quadratically with regard to the input sequence’s length. This renders processing long sequences prohibitively expensive. The goal of this project is to examine a selection of the models created to overcome these limitations and evaluate different aspects of their performance on downstream machine learning tasks. The BigPatent corpus, a collection of of U.S. patent documents, is used to set up benchmark task in which a variety of model configurations is tested on classifying patents according to their subject matter.

 

Research Questions:

 

  1. Which methods and models to encode long text sequences are most suited for downstream machine learning tasks?
  2. How does adapting a model’s attention mechanism to accommodate longer sequences effect performance on downstream machine learning tasks?
  3. Which classification model is most appropriate for patent subject matter classification?

Files and Subpages