Subdomain sensitive statistical parsing using raw corpora

Authors	B. Plank K. Sima'an
Publication date	2008
Book title	LREC 2008: Sixth International Conference on Language Resources and Evaluation: Proceedings
Event	Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco
Pages (from-to)	465-469
Publisher	European Language Resources Association (ELRA)
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.
Document type	Conference contribution
Published at	http://www.lrec-conf.org/proceedings/lrec2008/pdf/120_paper.pdf
Downloads	299337.pdf
Permalink to this page

Back

UvA-DARE