| With the rapid development of the Internet and its related technologies, massive andheterogeneous information appears on the web. How to extract and product knowledge fromthese vast amounts of unstructured data, to find interested content, has become an urgentneed to address this important issue. Chinese webpage classification technology as one ofthe key technologies to solve this problem, increasingly referred to as a hot research. It hasbeen more widely used in the field of search engine, information push, information filtering,and automatic question answering.This paper introduces the key technologies of Chinese webpage classification, includingpretreatment technology, feature extraction and mainstream webpage classificationalgorithms. Describes feature methods such as TF-IDF, mutual information, Chi-squarestatistic, information gain and expected cross entropy. Analyses the basic ideas and mainadvantages and disadvantages of the mainstream webpage classification algorithms such asminimum distance, KNN, naive Bayes and SVM.In the feature extraction algorithm of webpage, the traditional VSM model ignores thedependence of word items and the characteristics related with semantics. The wordco-occurrence graph can better solve this problem, but the mainstream word co-occurrencegraph methods simple mechanical compute the weight value of the feature words. Animproved word co-occurrence method presented by this paper considers not only thesemantic information of word items, but also the impact of high frequency words for thetheme. Experiments show that the method is simple to achieve higher accuracy rate.In the webpage classification algorithms, KNN algorithm has been widely used. Asignificant disadvantage of KNN is that the computational complexity will be increaselinearly along with the increasing of the training set size. The algorithm will be sufferedfrom the unsatisfied time consumption if the size of the training set is very large. In view ofthis situation, this paper propose an improved KNN algorithm. The main idea is toincreasing the algorithm efficiency by improving the strategies of finding the nearestneighbor points of unclassified text.In the end of this paper, the respective performance of KNN, NB and SVM algorithm isverified by experiment. And also list experiment result of improved KNN algorithm, tomake a comparative experiment. It has the advantage of to improve the classificationcomputational efficiency, reduce the complexity of the algorithm. |