Reducing Redundancy with Anchor Text and Spam Priors

Open Access
Authors
Publication date 2012
Host editors
  • E.M. Voorhees
  • L.P. Buckland
Book title The Twentieth Text REtrieval Conference Proceedings (TREC 2011)
Series NIST Special Publication, SP 500-296
Event The Twentieth Text REtrieval Conference (TREC 2011)
Number of pages 6
Publisher Gaithersburg, MD: National Institute of Standards and Technology
Organisations
  • Faculty of Humanities (FGw)
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
In this paper, we document our efforts in participating to the TREC 2011 Web Tracks. We had multiple aims: This year, tougher topics were selected for the Web Track, for which there is less popularity information available. We look at the relative value of anchor text for these less popular topics, and at impact of spam priors. Full-text retrieval on the ClueWeb09 B collection suffers from text spam, especially in the top 5 ranks. The spam prior largely reduces the impact of spam, leading to a boost in precision. We find that, in contrast to the more common queries of last year, anchor text does improve ad hoc retrieval performance of a full-text baseline for less common queries. However, for diversity, mixing anchor text and full-text leads to an improvement. Closer analysis reveals that mixing anchor text and full-text, fewer relevant nuggets are retrieved which cover more subtopics. Anchor text is an effective way of reducing redundancy and increasing coverage of subtopics at the same time.
Document type Conference contribution
Language English
Related publication University of Amsterdam at the TREC 2011 Web track
Published at https://trec.nist.gov/pubs/trec20/papers/uamsterdam.web.update.pdf https://trec.nist.gov/pubs/trec20/t20.proceedings.html
Downloads
uamsterdam.web.update (Final published version)
Permalink to this page
Back