Font Size: a A A

A foundation for general-purpose natural language generation: Sentence realization using probabilistic models of language

Posted on:2004-02-01Degree:Ph.DType:Thesis
University:University of Southern CaliforniaCandidate:Langkilde-Geary, IreneFull Text:PDF
GTID:2465390011975207Subject:Computer Science
Abstract/Summary:
Natural language generation (NLG) is the task of formulating a fluent sequence of words in natural language to communicate information or ideas in applications like machine translation, human-computer dialogue, automatic summarization, and question-answering. Realization, a fundamental subtask of NLG, produces an individual sentence from a sentence plan specified in terms of linguistic relations between words and/or concepts. It involves determining the order of words, inserting function words like determiners and prepositions, performing morphological inflections, and ensuring grammaticality and agreement.; An ultimate goal for natural language generation is to develop a large-scale, robust, general-purpose system. Two primary challenges are scaling up to broad coverage of syntax and producing high quality output. The irregularity of natural language makes it difficult to know how to combine linguistic primitives into fluent sentences. Also, the knowledge resources for making such a determination are time-consuming and labor-intensive to assemble, leading to a knowledge acquisition bottleneck. Evaluating whether a realizer performed appropriately is an additional challenge. There can often be more than one acceptable output, and no tools exist that can automatically assess grammaticality or fluency.; This thesis takes an approach of using probabilistic models learned from text corpora to rank candidate sentences and output the most likely. It contributes (1) a symbolic mapping rule formalism and ruleset for mapping inputs to candidate outputs that achieves broad coverage through greater regularity, (2) a packed forest representation and efficient ranking algorithm that can manage the combinatorial growth in output candidates, and (3) an empirical evaluation of coverage, correctness, and the ability to handle underspecification. This evaluation is the first large-scale empirical evaluation of coverage and quality ever performed for sentence realization.; The empirical evaluation is performed by automatically converting a set of 2400 hand-parsed sentences from the Penn Treebank corpus into system inputs, and then regenerating them using the system. The top-ranked output of the generator is compared to the original sentence. The results show better than 80% coverage of newspaper text and 94% precision (57% are exact matches) for almost fully-specified inputs, and the same coverage with 55% precision for minimally specified inputs.
Keywords/Search Tags:Natural language, Language generation, Sentence, Coverage, Using, Realization, Words, Inputs
Related items