Content-based Subject Categorization In Chinese News On Epidemiology

Posted on:2012-03-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Dai

Full Text:PDF

GTID:2178330335460777

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

Subject categorization is part of text categorization in data mining. According to characteristics of news and special requirements of data mining, this thesis suggests a new kind of subject categorization for Chinese news on epidemiology. In this thesis first summarize important subjects on epidemiology as a subject list and make a professional dictionary on epidemiology considering the professional knowledge. Make the Chinese news corpus on epidemiology with the scale of 300 pieces of news which are based on full-text search with key words'æµè¡Œç—…'(epidemic disease) from the search engine named Baidu. For every piece of news in the corpus, manual annotate subjects of content. Then present a method to get useful news from website using RSS feed. And also suggest how to get the text information from a piece of HTML page. The main part is subject categorization, which aim is to get the subjects from the content of news on epidemiology. In the part of subject categorization, we compromise the method based on the professional dictionary and the one based on TextTiling. The first one focuses on keywords in the content of news. The second one is usually applied to divide news reports into paragraphs. The subject categorization based on TextTiling just deals with the situation that after the content is classified to null. If the category is classified to null by the method based on the professional dictionary, it will get the revision on the result of the method based on TextTiling. It solves some problem of mistakes that lots of words are not registered in the professional dictionary. In the traditional algorithm, there are three main parts for discovering sub-subject. Because it is used for subject categorization in this thesis, we add the forth step to find subjects called subject location. And also there are some changes in the detail, different from the experiment of Hearst. For example, when set the weight of token, consider the frequency and the distribution of the token, and the factor whether the token presents in the title or not. The experimental system shows that the complex of the method based on the professional dictionary and the method based on TextTiling gets better results than just based on one kind of subject categorization. According to the experiment, the thesis evaluated the performance of the model, and some useful conclusions about implementation are reached. At last, suggest a multilingual information retrieval system on epidemiology for the future application. This retrieval system can explore the content of the news on epidemiology to track the development of an epidemic disease and group together news describing the same subject.

Keywords/Search Tags:

news on epidemiology, subject categorization, TextTiling, the professional dictionary

Related items

1	Research And Application On Techniques Of Lucene-Based Subject-Oriented Search System
2	Research On The Professional Identity Of Contemporary Newspapers In The Change Of Media Environment
3	Research On Text Categorization Algorithm For Science And Technology Text Based On Subject Conceptual Tree
4	Intelligent Microblog Information Generation Strategy Based On Subject Crawler And Text Categorization
5	AdaTextTiling: A New Adaptive Method To Text Segment Base On TextTiling
6	Science And Thchnology News In Modern Period Of China
7	The Investigation And Analyse On The Current Situation Of The Education Of News Professional Morality In The Universities Of The North East
8	Research On The Construction Of Professional Ability Of News Presenters Under The Background Of New Media
9	Inner Mongolia Television Professional News Channel Operation Pattern Design
10	Research On The Mode Of Integration Of Professional Finance And Sub - Network