On the statistical consistency of DOP estimators

Open Access
Authors
Publication date 2004
Host editors
  • B. Decadt
  • V. Hoste
  • G. De Pauw
Book title Proceedings of the 14th Meeting of Computational Linguistics in the Netherlands (CLIN 2003)
Event 14th Meeting of Computational Linguistics in The Netherlands (CLIN 2003), Antwerp, Belgium
Pages (from-to) 63-77
Publisher Antwerp: University of Antwerp
Organisations
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
A statistical estimator attempts to guess an unknown probability distribution by analyzing a sample from this distribution. One desirable property of an estimator is that its guess is increasingly likely to get arbitrarily close to the actual distribution as the sample size increases. This property is called consistency.
Data Oriented Parsing (DOP) employs all fragments of the trees in a training treebank, including the full parse-trees themselves, as the rewrite rules of a probabilistic tree-substitution grammar. Since the most popular DOP-estimator (DOP1) was shown to be inconsistent, there is an outstanding theoretical question concerning the possibility of DOP-estimators with reasonable statistical properties. This question constitutes the topic of the current paper.
First, we show that, contrary to common wisdom, any unbiased estimator for DOP is futile because it will not generalize over the training treebank. Subsequently, we show that a consistent estimator that generalizes over the treebank should involve a local smoothing technique. This exposes the relation between DOP and existing memory-based models that work with full memory and an analogical function such as k-nearest neighbor, which is known to implement backoff smoothing.
Finally, we present a new consistent backoff-based estimator for DOP and discuss how it combines the memory-based preference for the longest match with the probabilistic preference for the most frequent match.
Document type Conference contribution
Language English
Published at http://www.cnts.ua.ac.be/clin2003/proc/07Prescher.pdf
Downloads
225455.pdf (Final published version)
Permalink to this page
Back