| Subject categorization is part of text categorization in data mining. According to characteristics of news and special requirements of data mining, this thesis suggests a new kind of subject categorization for Chinese news on epidemiology. In this thesis first summarize important subjects on epidemiology as a subject list and make a professional dictionary on epidemiology considering the professional knowledge. Make the Chinese news corpus on epidemiology with the scale of 300 pieces of news which are based on full-text search with key words'æµè¡Œç—…'(epidemic disease) from the search engine named Baidu. For every piece of news in the corpus, manual annotate subjects of content. Then present a method to get useful news from website using RSS feed. And also suggest how to get the text information from a piece of HTML page. The main part is subject categorization, which aim is to get the subjects from the content of news on epidemiology. In the part of subject categorization, we compromise the method based on the professional dictionary and the one based on TextTiling. The first one focuses on keywords in the content of news. The second one is usually applied to divide news reports into paragraphs. The subject categorization based on TextTiling just deals with the situation that after the content is classified to null. If the category is classified to null by the method based on the professional dictionary, it will get the revision on the result of the method based on TextTiling. It solves some problem of mistakes that lots of words are not registered in the professional dictionary. In the traditional algorithm, there are three main parts for discovering sub-subject. Because it is used for subject categorization in this thesis, we add the forth step to find subjects called subject location. And also there are some changes in the detail, different from the experiment of Hearst. For example, when set the weight of token, consider the frequency and the distribution of the token, and the factor whether the token presents in the title or not. The experimental system shows that the complex of the method based on the professional dictionary and the method based on TextTiling gets better results than just based on one kind of subject categorization. According to the experiment, the thesis evaluated the performance of the model, and some useful conclusions about implementation are reached. At last, suggest a multilingual information retrieval system on epidemiology for the future application. This retrieval system can explore the content of the news on epidemiology to track the development of an epidemic disease and group together news describing the same subject. |