Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

doi:https://doi.org/10.1145/3652583.3658032

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

Authors	H. Zhu J.-H. Huang S. Rudinac E. Kanoulas
Publication date	2024
Book title	Proceedings of the 14th Annual ACM International Conference on Multimedia Retrieval (ICMR'24)
Book subtitle	Phuket, Thailand, June 10-14, 2024
ISBN (electronic)	9798400706196
Event	2024 International Conference on Multimedia Retrieval, ICMR 2024
Pages (from-to)	978-987
Number of pages	10
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI) Faculty of Law (FdR) Faculty of Science (FNWI) - Informatics Institute (IVI) Faculty of Science (FNWI)
Abstract	Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3652583.3658032
Other links	https://www.scopus.com/pages/publications/85199178911
Downloads	3652583.3658032 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models