A foundation for general-purpose natural language generation: Sentence realization using probabilistic models of language

Posted on:2004-02-01

Degree:Ph.D

Type:Thesis

University:University of Southern California

Candidate:Langkilde-Geary, Irene

Full Text:PDF

GTID:2465390011975207

Subject:Computer Science

Abstract/Summary:

Natural language generation (NLG) is the task of formulating a fluent sequence of words in natural language to communicate information or ideas in applications like machine translation, human-computer dialogue, automatic summarization, and question-answering. Realization, a fundamental subtask of NLG, produces an individual sentence from a sentence plan specified in terms of linguistic relations between words and/or concepts. It involves determining the order of words, inserting function words like determiners and prepositions, performing morphological inflections, and ensuring grammaticality and agreement.; An ultimate goal for natural language generation is to develop a large-scale, robust, general-purpose system. Two primary challenges are scaling up to broad coverage of syntax and producing high quality output. The irregularity of natural language makes it difficult to know how to combine linguistic primitives into fluent sentences. Also, the knowledge resources for making such a determination are time-consuming and labor-intensive to assemble, leading to a knowledge acquisition bottleneck. Evaluating whether a realizer performed appropriately is an additional challenge. There can often be more than one acceptable output, and no tools exist that can automatically assess grammaticality or fluency.; This thesis takes an approach of using probabilistic models learned from text corpora to rank candidate sentences and output the most likely. It contributes (1) a symbolic mapping rule formalism and ruleset for mapping inputs to candidate outputs that achieves broad coverage through greater regularity, (2) a packed forest representation and efficient ranking algorithm that can manage the combinatorial growth in output candidates, and (3) an empirical evaluation of coverage, correctness, and the ability to handle underspecification. This evaluation is the first large-scale empirical evaluation of coverage and quality ever performed for sentence realization.; The empirical evaluation is performed by automatically converting a set of 2400 hand-parsed sentences from the Penn Treebank corpus into system inputs, and then regenerating them using the system. The top-ranked output of the generator is compared to the original sentence. The results show better than 80% coverage of newspaper text and 94% precision (57% are exact matches) for almost fully-specified inputs, and the same coverage with 55% precision for minimally specified inputs.

Keywords/Search Tags:

Natural language, Language generation, Sentence, Coverage, Using, Realization, Words, Inputs

Related items

1	Natural Language Generation Systems Based On Halliday‘s Systemic Linguistics
2	Natural Language Understanding
3	Research On Paraphrase Generation Of Tibetan Language Declarative Sentence
4	Research On Poetry Generation Problem Based On Improved Sequence Generation Adversarial Network
5	Research And System Implementation Of Chinese Lexical Substitution
6	Research On Key Technologies Of Pun Recognition And Generation
7	The use of natural language processing for augmentative-alternative communication utterance generation
8	Syntactic form and discourse function in natural language generation
9	Studies On Complex Sentence Decomposition In Modern Chinese
10	On The Language Features In The Wind In The Willows