Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Ava Homiar; James Thomas; Edoardo G. Ostinelli; Jaycee Kennett; Claire Friedrich; Pim Cuijpers; Mathias Harrer; Stefan Leucht; Clara Miguel; Alessandro Rodolico; Yuki Kataoka; Tomohiro Takayama; Keisuke Yoshimura; Ryuhei So; Yasushi Tsujimoto; Yosuke Yamagishi; Shiro Takagi; Masatsugu Sakata; Đorđe Bašić; Eirini Karyotaki; Jennifer Potts; Georgia Salanti; Toshi A. Furukawa; Andrea Cipriani

doi:https://doi.org/10.1136/bmjment-2025-301762

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Authors	Ava Homiar James Thomas Edoardo G. Ostinelli Jaycee Kennett Claire Friedrich Pim Cuijpers Mathias Harrer Stefan Leucht Clara Miguel Alessandro Rodolico Yuki Kataoka Tomohiro Takayama Keisuke Yoshimura Ryuhei So Yasushi Tsujimoto Yosuke Yamagishi Shiro Takagi Masatsugu Sakata Đorđe Bašić Eirini Karyotaki Jennifer Potts Georgia Salanti Toshi A. Furukawa Andrea Cipriani
Publication date	01-2025
Journal	BMJ Mental Health
Article number	e301762
Volume \| Issue number	28 \| 1
Number of pages	8
Organisations	Faculty of Social and Behavioural Sciences (FMG) - Psychology Research Institute (PsyRes)
Abstract	Background Living systematic reviews (LSRs) maintain an updated summary of evidence by incorporating newly published research. While they improve review currency, repeated screening and selection of new references make them labourious and difficult to maintain. Large language models (LLMs) show promise in assisting with screening and data extraction, but more work is needed to achieve the high accuracy required for evidence that informs clinical and policy decisions. Objective The study evaluated the effectiveness of an LLM (GPT-4o) in title and abstract screening compared with human reviewers. Methods Human decisions from an LSR on prodopaminergic interventions for anhedonia served as the reference standard. The baseline search results were divided into a development and a test set. Prompts guiding the LLM's eligibility assessments were refined using the development set and evaluated on the test set and two subsequent LSR updates. Consistency of the LLM outputs was also assessed. Results Prompt development required 1045 records. When applied to the remaining baseline 11 939 records and two updates, the refined prompts achieved 100% sensitivity for studies ultimately included in the review after full-text screening, though sensitivity for records included by humans at the title and abstract stage varied (58-100%) across updates. Simulated workload reductions of 65-85% were observed. Prompt decisions showed high consistency, with minimal false exclusions, satisfying established screening performance benchmarks for systematic reviews. Conclusions Refined GPT-4o prompts demonstrated high sensitivity and moderate specificity while reducing human workload. This approach shows potential for integrating LLMs into systematic review workflows to enhance efficiency.
Document type	Article
Language	English
Published at	https://doi.org/10.1136/bmjment-2025-301762
Other links	https://www.scopus.com/pages/publications/105011394073
Downloads	Development and evaluation of prompts for a large language model (Final published version)
Supplementary materials	Appendix
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review