Reducing Redundancy with Anchor Text and Spam Priors

M. Koolen; J. Kamps

Reducing Redundancy with Anchor Text and Spam Priors

Authors	M. Koolen J. Kamps
Publication date	2012
Host editors	E.M. Voorhees L.P. Buckland
Book title	The Twentieth Text REtrieval Conference Proceedings (TREC 2011)
Series	NIST Special Publication, SP 500-296
Event	The Twentieth Text REtrieval Conference (TREC 2011)
Number of pages	6
Publisher	Gaithersburg, MD: National Institute of Standards and Technology
Organisations	Faculty of Humanities (FGw) Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	In this paper, we document our efforts in participating to the TREC 2011 Web Tracks. We had multiple aims: This year, tougher topics were selected for the Web Track, for which there is less popularity information available. We look at the relative value of anchor text for these less popular topics, and at impact of spam priors. Full-text retrieval on the ClueWeb09 B collection suffers from text spam, especially in the top 5 ranks. The spam prior largely reduces the impact of spam, leading to a boost in precision. We find that, in contrast to the more common queries of last year, anchor text does improve ad hoc retrieval performance of a full-text baseline for less common queries. However, for diversity, mixing anchor text and full-text leads to an improvement. Closer analysis reveals that mixing anchor text and full-text, fewer relevant nuggets are retrieved which cover more subtopics. Anchor text is an effective way of reducing redundancy and increasing coverage of subtopics at the same time.
Document type	Conference contribution
Language	English
Related publication	University of Amsterdam at the TREC 2011 Web track
Published at	https://trec.nist.gov/pubs/trec20/papers/uamsterdam.web.update.pdf (Final published version) https://trec.nist.gov/pubs/trec20/t20.proceedings.html (Final published version)
Downloads	uamsterdam.web.update (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Reducing Redundancy with Anchor Text and Spam Priors