Reducing Workload for Title and Abstract Screening in Medical
Systematic Reviews: A Comparison of ML and LLM Approaches.
Abstract
Systematic reviews play a crucial role in evidence-based medicine, but the process of selecting relevant studies is highly time-consuming. In particular, title and abstract screening (TIAB) requires researchers to manually evaluate thousands of publications, often taking several months and significant resources in terms of time and money.
With the rapid growth of (biomedical) literature, there is an increasing need for automated solutions that can support researchers with this task.
This master thesis explores how natural language processing (NLP), machine learning (ML), and large language models (LLMs) can reduce the workload of TIAB screening.
Three approaches are compared: (1) using embeddings with classic ML classifiers, (2) fine-tuning transformer-based models, and (3) applying retrieval-augmented few-shot prompting with LLMs.
A labeled dataset from a medical domain is used as the basis for evaluation, focusing on accuracy, efficiency, and practical applicability. The results of this benchmark provide insights into the strengths and limitations of different methods and highlight how combining automation with human expertise can make systematic reviews faster, more reliable, and less resource-intensive.
Research Questions
- RQ1: What is the current approach in existing literature for using ML for TIAB?
- RQ2: Can an embedding with a classic ML classifier pipeline alone achieve reliable include/exclude decisions in systematic reviews?
- RQ3: Does end-to-end fine-tuning of a transformer meaningfully improve screening accuracy over non-tuned baselines?
- RQ4: Can retrieval-augmented few-shot prompting with an LLM consistently deliver high-quality screening without any model fine-tuning?
- RQ5: Benchmark of the 3 proposed methods among themselves + the baseline.
References
- Akinseloyin, O., Jiang, X. and Palade, V. (2024) ‘A question-answering framework for automated abstract screening using large language models’, Journal of the American Medical Informatics Association, 31(9), pp. 1939–1952. doi:10.1093/jamia/ocae166.
- Gu, Y. et al. (2021) ‘Domain-specific language model pretraining for biomedical natural language processing’, ACM Transactions on Computing for Healthcare, 3(1), pp. 1–23. doi:10.1145/3458754.
- Serrato-Fonseca, J.I. (no date) ‘ACAS: A comprehensive framework for automatic abstract screening in systematic literature reviews’, Les Cahiers du GERAD [Preprint], (September 2024).
- Hasny, M. et al. (2023) ‘Bert for complex systematic review screening to support the future of medical research’, Lecture Notes in Computer Science, pp. 173–182. doi:10.1007/978-3-031-34344-5_21.
- Kebede, M.M., Le Cornet, C. and Fortner, R.T. (2022) ‘In‐depth evaluation of Machine Learning Methods for Semi‐automating article screening in a systematic review of mechanistic literature’, Research Synthesis Methods, 14(2), pp. 156–172. doi:10.1002/jrsm.1589.
- Lange, T. et al. (2021) ‘Machine learning for identifying relevant publications in updates of systematic reviews of Diagnostic Test Studies’, Research Synthesis Methods, 12(4), pp. 506–515. doi:10.1002/jrsm.1486.
- Nykvist, B. et al. (2025) ‘Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis’, Environmental Evidence, 14(1). doi:10.1186/s13750-025-00360-x.
- Sanghera, R. High-performance automated abstract screening with large language model ensembles.
- Pilz, M. et al. (2024) ‘Semi-automated title-abstract screening using Natural Language Processing and machine learning’, Systematic Reviews, 13(1). doi:10.1186/s13643-024-02688-w.