Previous language models (LMs) have been inadequate for long-text summarization due to their limited context windows, which restrict their ability to effectively process and com- prehend extensive textual data. This limitation has made it challenging for traditional LMs to capture critical long-range dependencies and contextual nuances in longer documents. Models with significantly longer context windows have emerged through various innovative improvements to model architecture and training practices. Large language models (LLMs) are tremendously effective in processing large amounts of textual data and show outstanding results for various NLP tasks. Despite their proficiency with extended context windows, quantifying the performance of LLMs on text summarization and question-answering tasks remains largely unexplored on context window sizes of up to 128k tokens, representing a significant research gap in the field of NLP.
To address this gap, our research implements new traceability metrics to evaluate how well LLMs utilize information within their extended context windows for long-text summarization and question-answering tasks. We propose an evaluation framework integrating automated traceability metrics to provide a nuanced understanding of model performance. Our approach focuses on determining which information models use and which they neglect depending on their context window size.
For our evaluation, we employ three datasets, InfiniteBench, BookSum, and XSum, that exhibit characteristics essential for extended context windows, such as uniform distribution of salient information. We compare the performance across four models: LLaMA 3.1, Qwen 2.5, Phi 3, and Command R7B.
Ultimately, our research reveals that the models exhibit inherent positional biases, leading to varying performance based on the location of information within their context window and the overall length of that context window. These findings highlight the importance of understanding how context dynamics influence the efficacy of text summarization and question-answering tasks.
Keywords: Large Language Models, Benchmarking, Text Summarization, Question-Answering
Name | Type | Size | Last Modification | Last Editor |
---|---|---|---|---|
250224 Magg Clemens BT Final.pdf | 3,84 MB | 09.05.2025 | ||
250414 Clemens Magg BT Final Presentation.pdf | 1,61 MB | 09.05.2025 | ||
Clemens Magg BT Kickoff.pdf | 1,57 MB | 09.05.2025 |