Reducing Redundancy with Anchor Text and Spam Priors
| Authors | |
|---|---|
| Publication date | 2012 |
| Host editors |
|
| Book title | The Twentieth Text REtrieval Conference Proceedings (TREC 2011) |
| Series | NIST Special Publication, SP 500-296 |
| Event | The Twentieth Text REtrieval Conference (TREC 2011) |
| Number of pages | 6 |
| Publisher | Gaithersburg, MD: National Institute of Standards and Technology |
| Organisations |
|
| Abstract |
In this paper, we document our efforts in participating to the TREC 2011 Web Tracks. We had multiple aims: This year, tougher topics were selected for the Web Track, for which there is less popularity information available. We look at the relative value of anchor text for these less popular topics, and at impact of spam priors. Full-text retrieval on the ClueWeb09 B collection suffers from text spam, especially in the top 5 ranks. The spam prior largely reduces the impact of spam, leading to a boost in precision. We find that, in contrast to the more common queries of last year, anchor text does improve ad hoc retrieval performance of a full-text baseline for less common queries. However, for diversity, mixing anchor text and full-text leads to an improvement. Closer analysis reveals that mixing anchor text and full-text, fewer relevant nuggets are retrieved which cover more subtopics. Anchor text is an effective way of reducing redundancy and increasing coverage of subtopics at the same time.
|
| Document type | Conference contribution |
| Language | English |
| Related publication | University of Amsterdam at the TREC 2011 Web track |
| Published at | https://trec.nist.gov/pubs/trec20/papers/uamsterdam.web.update.pdf https://trec.nist.gov/pubs/trec20/t20.proceedings.html |
| Downloads |
uamsterdam.web.update
(Final published version)
|
| Permalink to this page | |
