Report Number: CS-TR-94-1502
Institution: Stanford University, Department of Computer Science
Title: Natural Language Parsing as Statistical Pattern Recognition
Author: Magerman, David M.
Date: February 1994
Abstract: Traditional natural language parsers are based on rewrite
rule systems developed in an arduous, time-consuming manner
by grammarians. A majority of the grammarian's efforts are
devoted to the disambiguation process, first hypothesizing
rules which dictate constituent categories and relationships
among words in ambiguous sentences, and then seeking
exceptions and corrections to these rules.
In this work, I propose an automatic method for acquiring a
statistical parser from a set of parsed sentences which takes
advantage of some initial linguistic input, but avoids the
pitfalls of the iterative and seemingly endless grammar
development process. Based on distributionally-derived and
linguistically-based features of language, this parser
acquires a set of statistical decision trees which assign a
probability distribution on the space of parse trees given
the input sentence. By basing the disambiguation criteria
selection on entropy reduction rather than human intuition,
this parser development method is able to consider more
sentences than a human grammarian can when making individual
disambiguation rules.
In experiments, the decision tree parser significantly
outperforms a grammarian's rule-based parser, achieving an
accuracy rate of 78% compared to the rule-based parser's 69%.
http://i.stanford.edu/pub/cstr/reports/cs/tr/94/1502/CS-TR-94-1502.pdf