Development and evaluation of prompts for a large language model to screen titles and abstracts in a living systematic review

Open Access
Authors
  • Ava Homiar
  • James Thomas
  • Edoardo G. Ostinelli
  • Jaycee Kennett
  • Claire Friedrich
  • Pim Cuijpers
  • Mathias Harrer
  • Stefan Leucht
  • Clara Miguel
  • Alessandro Rodolico
  • Yuki Kataoka
  • Tomohiro Takayama
  • Keisuke Yoshimura
  • Ryuhei So
  • Yasushi Tsujimoto
  • Yosuke Yamagishi
  • Shiro Takagi
  • Masatsugu Sakata
  • Đorđe Bašić
  • Eirini Karyotaki
  • Jennifer Potts
  • Georgia Salanti
  • Toshi A. Furukawa
  • Andrea Cipriani
Publication date 01-2025
Journal BMJ Mental Health
Article number e301762
Volume | Issue number 28 | 1
Number of pages 8
Organisations
  • Faculty of Social and Behavioural Sciences (FMG) - Psychology Research Institute (PsyRes)
Abstract
Background  Living systematic reviews (LSRs) maintain an updated summary of evidence by incorporating newly published research. While they improve review currency, repeated screening and selection of new references make them labourious and difficult to maintain. Large language models (LLMs) show promise in assisting with screening and data extraction, but more work is needed to achieve the high accuracy required for evidence that informs clinical and policy decisions.
Objective  The study evaluated the effectiveness of an LLM (GPT-4o) in title and abstract screening compared with human reviewers.
Methods  Human decisions from an LSR on prodopaminergic interventions for anhedonia served as the reference standard. The baseline search results were divided into a development and a test set. Prompts guiding the LLM's eligibility assessments were refined using the development set and evaluated on the test set and two subsequent LSR updates. Consistency of the LLM outputs was also assessed.
Results  Prompt development required 1045 records. When applied to the remaining baseline 11 939 records and two updates, the refined prompts achieved 100% sensitivity for studies ultimately included in the review after full-text screening, though sensitivity for records included by humans at the title and abstract stage varied (58-100%) across updates. Simulated workload reductions of 65-85% were observed. Prompt decisions showed high consistency, with minimal false exclusions, satisfying established screening performance benchmarks for systematic reviews.
Conclusions  Refined GPT-4o prompts demonstrated high sensitivity and moderate specificity while reducing human workload. This approach shows potential for integrating LLMs into systematic review workflows to enhance efficiency.
Document type Article
Language English
Published at https://doi.org/10.1136/bmjment-2025-301762
Other links https://www.scopus.com/pages/publications/105011394073
Downloads
Supplementary materials
Permalink to this page
Back