Font Size: a A A

Extraction d'information a partir de transcriptions de conversations telephoniques specialisees

Posted on:2006-11-27Degree:Ph.DType:Thesis
University:Universite de Montreal (Canada)Candidate:Boufaden, NarjesFull Text:PDF
GTID:2458390005998081Subject:Language
Abstract/Summary:PDF Full Text Request
Information extraction (IE) is about seeking instances of event classes and relations and extracting their arguments from text within a particular domain. Standard approaches rely heavily on syntactic processing to determine the thematic roles of relevant information, while little semantic processing is done and is restricted to named entity extraction. Applying standard approaches to specialized conversational texts faces two problems that have to do with the text's rhetoric structure and with specialization.; Scattered information and disfluencies such as repetitions and omissions are examples of difficulties in IE from conversational texts. Question-answer pairs, widely used in conversations, are examples where bits of information are conveyed through successive utterances. In addition, the spontaneous character of composition decreases information density conveyed in utterances while increasing their number. Information is often conveyed through several utterances by the means of pronominal anaphora.; Edited words, omissions and interruptions are examples of disfluencies that alter the utterance structure causing a significant decrease of performance in part-of-speech tagging and parsing. Furthermore, altering the syntactic structure of utterances makes syntactic-driven learning of extraction patterns difficult if not impossible.; On the other hand, specialized texts are characterized by a sub-language including a specialized vocabulary referring to relevant domain concepts that require a semantic tagging process that goes beyond named entity extraction.; In this thesis, we show that standard approaches are unsuitable for specialized conversational texts. We propose a new IE approach that takes into account the characteristics of specialized conversational texts. Our main claim is that a word-meaning based approach is more suitable than a context based approach for these texts. The core component of our approach is a robust semantic tagger based on a statistical model that labels relevant bits of information with concepts drawn from a domain ontology. Word labels are used to learn predicate-arguments relations relevant to the domain.; In addition to the robust semantic tagger, our five step approach includes a linguistic segmentation stage that defines units suitable for the pattern learning process. Topic segmentation identifies topically coherent units that help anaphora resolution. The latter helps to identify more relevant relations hidden by the pronominalization of the topic. This stage precedes the pattern learning stage, which is based on Markov models that include wild card states designed to handle edited words and null transitions to handle omissions.; We tested our approach on manually transcribed telephone conversations in the domain of maritime search and rescue, and succeeded in extracting individual facts with an F-score of 78.9%.
Keywords/Search Tags:Information, Extraction, Conversations, Specialized conversational texts, Domain
PDF Full Text Request
Related items