Font Size: a A A

Research Of Chinese Web Page Classification Based On The Algorithm Of Feature Selection And Weights Calculation

Posted on:2011-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:L C KongFull Text:PDF
GTID:2178360305472997Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In modern society, the Internet has been dramatically changing our lives. Facing a huge amount of information on the Internet, the problem of how to get the information we really want becomes a very important issue. Thus, page classification has become a popular area of research. The web page categorization is a process using computers to classify large quantity of web pages automatically according to some categorization rules. It can organize the web pages orderly, improve the performance of information retrieval system and increase the availability of web resources. Feature selection and weights calculation are key steps of web page categorization, they are also prerequisite to improving the efficiency of web page classification. What's more, the algorithm will directly affect the performance of classifier.In the process of establishing Chinese Web page classification System, we have made a thorough study on the approaches of Web page classification, including Chinese Web page information extraction Chinese phrase segmentation feature extraction weights calculation classification of Web page, and etc. The author also proposes his improved algorithm based on traditional algorithm of feature extraction and weights calculation. The main works of the thesis are as follows:Firstly, the paper introduces present Research situation in China and foreign country and research methods about web page categorization, and pointed out emphasis and difficulty of the research.Secondly, we research the application in page classification and defects of the traditional MI algorithm and the traditional tf-idf formula deeply, finding out that the traditional MI algorithm ignores the features whose MI are negative and is too inclined to the words with low occurrence probabilities,and the traditional tf-idf formula ignores the distribution of the features among all categories, and improve the traditional algorithm on the basis of the above. The superiority and feasibility of improvement are verified through the experiments.Finally, this paper makes use of supervised machine learning theory to implement a Web pages classifier. The method can be conducted as following, text segmentation, feature extraction using Improved MI, improving Traditional TF-IDF Formula, and constructing classifier according to KNN. We did many experiments and the experimental results showed the superiority of the improved algorithm compared to traditional algorithm, a higher precision was achieved.With the rapid surge of Internet information, network data mining has increasingly become a major academic research field. As a important branch of network data mining, the Chinese web page classification has great research value and practical significance.
Keywords/Search Tags:Chinese Web Page Classification, feature extraction, weights calculation
PDF Full Text Request
Related items