Learning structural dependencies of words in the Zipfian Tail

Authors
Publication date 04-2014
Journal Journal of Logic and Computation
Volume | Issue number 24 | 2
Pages (from-to) 433-453
Number of pages 21
Organisations
  • Faculty of Science (FNWI)
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract

This article uses semi-supervised Expectation Maximization (EM) to learn lexico-syntactic dependencies, i.e. associations between words and the structures that occur with them. Due to Zipfian distributions in language, such dependencies are extremely sparse in labelled data, and unlabelled data are the only source for learning them. Specifically, we learn sparse lexical parameters of a generative parsing model (a Probabilistic Context-Free Grammar, PCFG) that is initially estimated over the Penn Treebank. Our lexical parameters are similar to supertags - they are fine-grained, and encode complex structural information at the pre-terminal level. Our goal is to use unlabelled data to learn these for words that are rare or unseen in the labelled data. We get large error reductions (up to 17.5%) in parsing ambiguous structures associated with unseen verbs, the most important case of learning lexico-structural dependencies, resulting in a statistically significant improvement in labelled bracketing score of the treebank PCFG. Our semi-supervised method incorporates structural and lexical priors from the labelled data to guide estimation from unlabelled data, and is the first successful use of semi-supervised EM to improve a generative structured model already trained over large labelled data. The method scales well to larger amounts of unlabelled data, and also gives substantial error reductions (up to 11.5%) for models trained on smaller amounts of labelled data, making it relevant to low-resource languages with small treebanks as well.

Document type Article
Language English
Published at https://doi.org/10.1093/logcom/exs062
Other links https://www.scopus.com/pages/publications/84896997749
Permalink to this page
Back