Extraction d'information a partir de transcriptions de conversations telephoniques specialisees

Posted on:2006-11-27

Degree:Ph.D

Type:Thesis

University:Universite de Montreal (Canada)

Candidate:Boufaden, Narjes

Full Text:PDF

GTID:2458390005998081

Subject:Language

Abstract/Summary:

PDF Full Text Request

Information extraction (IE) is about seeking instances of event classes and relations and extracting their arguments from text within a particular domain. Standard approaches rely heavily on syntactic processing to determine the thematic roles of relevant information, while little semantic processing is done and is restricted to named entity extraction. Applying standard approaches to specialized conversational texts faces two problems that have to do with the text's rhetoric structure and with specialization.; Scattered information and disfluencies such as repetitions and omissions are examples of difficulties in IE from conversational texts. Question-answer pairs, widely used in conversations, are examples where bits of information are conveyed through successive utterances. In addition, the spontaneous character of composition decreases information density conveyed in utterances while increasing their number. Information is often conveyed through several utterances by the means of pronominal anaphora.; Edited words, omissions and interruptions are examples of disfluencies that alter the utterance structure causing a significant decrease of performance in part-of-speech tagging and parsing. Furthermore, altering the syntactic structure of utterances makes syntactic-driven learning of extraction patterns difficult if not impossible.; On the other hand, specialized texts are characterized by a sub-language including a specialized vocabulary referring to relevant domain concepts that require a semantic tagging process that goes beyond named entity extraction.; In this thesis, we show that standard approaches are unsuitable for specialized conversational texts. We propose a new IE approach that takes into account the characteristics of specialized conversational texts. Our main claim is that a word-meaning based approach is more suitable than a context based approach for these texts. The core component of our approach is a robust semantic tagger based on a statistical model that labels relevant bits of information with concepts drawn from a domain ontology. Word labels are used to learn predicate-arguments relations relevant to the domain.; In addition to the robust semantic tagger, our five step approach includes a linguistic segmentation stage that defines units suitable for the pattern learning process. Topic segmentation identifies topically coherent units that help anaphora resolution. The latter helps to identify more relevant relations hidden by the pronominalization of the topic. This stage precedes the pattern learning stage, which is based on Markov models that include wild card states designed to handle edited words and null transitions to handle omissions.; We tested our approach on manually transcribed telephone conversations in the domain of maritime search and rescue, and succeeded in extracting individual facts with an F-score of 78.9%.

Keywords/Search Tags:

Information, Extraction, Conversations, Specialized conversational texts, Domain

PDF Full Text Request

Related items

1	A Research On Methods Of Knowledge Acquisition From Domain-Specific Texts And Their Application In Knowledge Acquisition From Archaeological Texts
2	Domain Term Automatic Acquisition From Unstructured Texts
3	Research On Related Technologies Of Domain Information Extraction
4	Research On Methods Of Drug Information Extraction From Biomedical Texts
5	Identifying the gist of conversational text: Automatic keyword extraction and summarization
6	Studies in the use of color for image indexing and retrieval in specialized databases
7	Research On DOM Based Intelligent Web Information Extraction Technology
8	Research On Utterance-Level Emotion Analysis In Conversations
9	Research On Event Extraction Techniques For Domain Texts
10	Affective Computing For Multi-characteristic Social Media Texts