Font Size: a A A

Across The English And Chinese Research Topic Detection And Tracking Technology

Posted on:2014-01-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q LuFull Text:PDF
GTID:1225330401958599Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The world has gradually entered the information and digital era. According to the CNNIC30th survey report, at the end of June2012, the number of Internet users in China has reached538,000,000, among them,392,000,000are network news user, and the rate of internet users use the network news is as high as73%. To releases the news easily and fast, Internet is becoming the "fourth media news communication". Ordinary people hope to get interest news from a massive cyber source as well as to know the news in other countries. Therefore, Cross-language detection and tracking of the network news has gradually become a hot topic in current study.Cross-language detection and tracking of the network news have to face several challenging puzzles. First, the lacking of network news description means makes the cross-language topic description more difficultly; Second, cross-language topic detection and tracking need to implement the news reports about multi-language environment, how to cross the language divide is one of the technical problems ready to be solved. Third, the further development of existing technology and apply it to the topic detection and tracking research is worth further researching. In response to these problems, we hope the research on English-Chinese cross-language topic detection and tracking technology can make a modest contribution to the development of the related technologies for language processing, as well as to provide a reference for our multi-ethnic language text processing.This paper includes five parts:the analysis on cross-language news reports text, cross-language topic model building methods, corpus construction methods, the detection and tracking of cross-language topic.First of all, we analysis the core elements of the news reports, and get the conclusion that lexical processing and news elements both can be used as means to distinguish different report text.Then starting from the relationship of "reports-topic-event", we explain the basic idea of CLTDT research and analysis the shortcomings of current commonly used text representation models. In our opinion, early text representation models are lack of in-depth description to the "reported-topic-events" relationship. In order to reveal the hidden topic in the news text, we selected the LSI model and the LDA model for text modeling experiments, and through the experimental comparison and analysis, we evaluated the ability of the two models for the description of the text of the news reports.On the basis of above theoretical analysis and experimental verification, we propose the ideas of a cross-language topic detection and tracking research which conducted on the basis of the English-Chinese Comparable Corpus. By several processes, such as collecting data, metadata labeling and named entity annotation, we attempts to establish a "English-Chinese Cross-Language News Reports Comparable Corpus which will served as the base for our cross-language news topic detection and tracking research.Integrated current cross-language processing technology and the LDA model research with the purposes of this paper, we propose a cross-language united LDA (CLU-LDA) model. This model not only could detect event retrospectively with both English and Chinese news reports data but also can help us found new event.In cross-language topic tracking field, with the existing prior knowledge and comparable corpus, we can not only describe the development status of news events in the time series, but also tracking specific news reports effectively as well.
Keywords/Search Tags:Cross-language Topic Detection, Cross-language TopicTracking, Comparable Corpus, LDA Model
PDF Full Text Request
Related items