Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
| Authors | |
|---|---|
| Publication date | 2024 |
| Book title | Proceedings of the 14th Annual ACM International Conference on Multimedia Retrieval (ICMR'24) |
| Book subtitle | Phuket, Thailand, June 10-14, 2024 |
| ISBN (electronic) |
|
| Event | 2024 International Conference on Multimedia Retrieval, ICMR 2024 |
| Pages (from-to) | 978-987 |
| Number of pages | 10 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation. |
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1145/3652583.3658032 |
| Other links | https://www.scopus.com/pages/publications/85199178911 |
| Downloads |
3652583.3658032
(Final published version)
|
| Permalink to this page | |
