Improving the Reusability of Conversational Search Test Collections

Zahra Abbasiantaeb; Chuan Meng; Leif Azzopardi; Mohammad Aliannejadi

doi:https://doi.org/10.1007/978-3-031-88708-6_13

Improving the Reusability of Conversational Search Test Collections

Authors	Zahra Abbasiantaeb Chuan Meng Leif Azzopardi Mohammad Aliannejadi
Publication date	2025
Host editors	Claudia Hauff Craig Macdonald Dietmar Jannach Gabriella Kazai Franco Maria Nardini Fabio Pinelli Fabrizio Silvestri Nicola Tonellotto
Book title	Advances in Information Retrieval
Book subtitle	47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025 : proceedings
ISBN	9783031887079
ISBN (electronic)	9783031887086
Series	Lecture Notes in Computer Science
Event	47th European Conference on Information Retrieval, ECIR 2025
Volume \| Issue number	I
Pages (from-to)	196-213
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1007/978-3-031-88708-6_13 (Final published version)
Downloads	Improving the Reusability of Conversational Search Test Collections (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Improving the Reusability of Conversational Search Test Collections