Research On Classification Algorithm For Chinese Webpage

Posted on:2014-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:Q Qian

Full Text:PDF

GTID:2268330422467381

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and its related technologies, massive andheterogeneous information appears on the web. How to extract and product knowledge fromthese vast amounts of unstructured data, to find interested content, has become an urgentneed to address this important issue. Chinese webpage classification technology as one ofthe key technologies to solve this problem, increasingly referred to as a hot research. It hasbeen more widely used in the field of search engine, information push, information filtering,and automatic question answering.This paper introduces the key technologies of Chinese webpage classification, includingpretreatment technology, feature extraction and mainstream webpage classificationalgorithms. Describes feature methods such as TF-IDF, mutual information, Chi-squarestatistic, information gain and expected cross entropy. Analyses the basic ideas and mainadvantages and disadvantages of the mainstream webpage classification algorithms such asminimum distance, KNN, naive Bayes and SVM.In the feature extraction algorithm of webpage, the traditional VSM model ignores thedependence of word items and the characteristics related with semantics. The wordco-occurrence graph can better solve this problem, but the mainstream word co-occurrencegraph methods simple mechanical compute the weight value of the feature words. Animproved word co-occurrence method presented by this paper considers not only thesemantic information of word items, but also the impact of high frequency words for thetheme. Experiments show that the method is simple to achieve higher accuracy rate.In the webpage classification algorithms, KNN algorithm has been widely used. Asignificant disadvantage of KNN is that the computational complexity will be increaselinearly along with the increasing of the training set size. The algorithm will be sufferedfrom the unsatisfied time consumption if the size of the training set is very large. In view ofthis situation, this paper propose an improved KNN algorithm. The main idea is toincreasing the algorithm efficiency by improving the strategies of finding the nearestneighbor points of unclassified text.In the end of this paper, the respective performance of KNN, NB and SVM algorithm isverified by experiment. And also list experiment result of improved KNN algorithm, tomake a comparative experiment. It has the advantage of to improve the classificationcomputational efficiency, reduce the complexity of the algorithm.

Keywords/Search Tags:

Chinese Webpage Classification, Vector Space Model(VSM), WordCo-occurrence Graph, KNN

PDF Full Text Request

Related items

1	Research On Chinese Webpages Classification Based On K-nearest Neighbour Algorithm And Relative Hyperlinks
2	Research And Implementation Of Automatic Classification System And Key Technologies On Chinese Web Page
3	Improved Vector Space Model And Its Application To Document Classification System
4	Research On Duplicate Removal And Similarity Evaluation Of Chinese Agricultural Web Pages
5	Research Of Chinese Page Automatic Classification Based On Vector Space Model
6	Automatic Classification Research On Chinese Web Document Orientation
7	Research Of Chinese Text Classification Based On Improved Vector Space Model
8	Research And Application Of Chinese Web Pages Automatic Classification
9	A Research On Large Scale Automatic Chinese Webpages Classification
10	Research And Implementation Of Web Page Classification Based On CNN And SVM