Parsing with subdomain instance weighting from raw corpora

B. Plank; K. Sima'an

Parsing with subdomain instance weighting from raw corpora

Authors	B. Plank K. Sima'an
Publication date	2008
Journal	Interspeech
Event	9th Annual Conference of the International Speech Communication Association (Interspeech 2008), Brisbane, Australia
Volume \| Issue number	9
Pages (from-to)	2540
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	The treebanks that are used for training statistical parsers consist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, music), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a method, subdomain instance-weighting, that exploits raw subdomain corpora for introducing subdomain statistics into a state-of-the-art generative parser. We employ instance-weighting for creating an ensemble of subdomain specific versions of the parser, and explore methods for amalgamating their predictions. Our experiments show that subdomain statistics extracted from raw corpora can even improve the quality of the n-best lists of a formidable, state-of-the-art parser.
Document type	Article
Note	Proceedings title: Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008): incorporating the 12th Australian International Conference on Speech Science and Technology (SST 2008): 22-26 September 2008, Brisbane, Australia Publisher: International Speech Communication Association
Language	English
Published at	https://www.isca-speech.org/archive/interspeech_2008/i08_2540.html (Final published version) http://www.let.rug.nl/~bplank/papers/interspeech2008.pdf
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Parsing with subdomain instance weighting from raw corpora