Improving the Generalizability of the Dense Passage Retriever

Authors
Publication date 2023
Host editors
  • J. Kamps
  • L. Goeuriot
  • F. Crestani
  • M. Maistro
  • H. Joho
  • B. Davis
  • C. Gurrin
  • U. Kruschwitz
  • A. Caputo
Book title Advances in Information Retrieval
Book subtitle 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023 : proceedings
ISBN
  • 9783031282379
ISBN (electronic)
  • 9783031282386
Series Lecture Notes in Computer Science
Event 45th European Conference on Information Retrieval
Volume | Issue number II
Pages (from-to) 94-109
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Dense retrieval methods have surpassed traditional sparse retrieval methods for open-domain retrieval. While these methods, such as the Dense Passage Retriever (DPR), work well on domains or datasets they have been trained on, there is a noticeable loss in accuracy when tested on out-of-distribution and out-of-domain datasets. We hypothesize that this may be, in large part, due to the mismatch in the information available to the context encoder and the query encoder during training. Most training datasets commonly used for training dense retrieval models contain an overwhelming majority of passages where there is only one query from a passage. We hypothesize that this imbalance encourages dense retrieval models to overfit to a single potential query from a given passage leading to worse performance on out-of-distribution and out-of-domain queries. To test this hypothesis, we focus on a prominent dense retrieval method, the dense passage retriever, build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on single query per passage datasets. Using the generated datasets, we show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets.
Document type Conference contribution
Language English
Published at https://doi.org/10.1007/978-3-031-28238-6_7
Permalink to this page
Back