Back to top

Bachelor's Thesis Clemens Magg

Last modified May 9

Studying the Effectiveness of Longer Context Windows in LLMs for Summarization and Question Answering Tasks

 

Abstract

 

Previous language models (LMs) have been inadequate for long-text summarization due to their limited context windows, which restrict their ability to effectively process and com- prehend extensive textual data. This limitation has made it challenging for traditional LMs to capture critical long-range dependencies and contextual nuances in longer documents. Models with significantly longer context windows have emerged through various innovative improvements to model architecture and training practices. Large language models (LLMs) are tremendously effective in processing large amounts of textual data and show outstanding results for various NLP tasks. Despite their proficiency with extended context windows, quantifying the performance of LLMs on text summarization and question-answering tasks remains largely unexplored on context window sizes of up to 128k tokens, representing a significant research gap in the field of NLP.

To address this gap, our research implements new traceability metrics to evaluate how well LLMs utilize information within their extended context windows for long-text summarization and question-answering tasks. We propose an evaluation framework integrating automated traceability metrics to provide a nuanced understanding of model performance. Our approach focuses on determining which information models use and which they neglect depending on their context window size.

 For our evaluation, we employ three datasets, InfiniteBench, BookSum, and XSum, that exhibit characteristics essential for extended context windows, such as uniform distribution of salient information. We compare the performance across four models: LLaMA 3.1, Qwen 2.5, Phi 3, and Command R7B.

Ultimately, our research reveals that the models exhibit inherent positional biases, leading to varying performance based on the location of information within their context window and the overall length of that context window. These findings highlight the importance of understanding how context dynamics influence the efficacy of text summarization and question-answering tasks.

Keywords: Large Language Models, Benchmarking, Text Summarization, Question-Answering

 

Research Questions

  • RQ1: How can we adequately test the quality of text summarization of LLMs? Does the quality of the generated summary improve if more content of the article is passed?
  • RQ2: What are the most effective techniques for extending the context window of LLMs?
  • RQ3: How do LLMs use the information contained in their context? Do LLMs benefit from a long context window for text summarization? Is the model able to pay attention to all parts of the document, or is it clustered toward some parts?
 

Files and Subpages

Name Type Size Last Modification Last Editor
250224 Magg Clemens BT Final.pdf 3,84 MB 09.05.2025
250414 Clemens Magg BT Final Presentation.pdf 1,61 MB 09.05.2025
Clemens Magg BT Kickoff.pdf 1,57 MB 09.05.2025