With the development of the Internet, there are a large number of websites every day. A lot of Web sites lead to huge amounts of Web pages. The information contained in these Web pages is very broad, and the content is different. So how to accurately obtain the information from large Web pages is the key to improve the efficiency of our study and work, and Web text categorization plays an extremely important role in this case.Combined with the Chinese text classification process, the paper expounds and summarizes from five aspects including obtaining the pages, Chinese word segmentation, feature extraction, improvement and implementation of classification algorithm. The main work of the paper is stated as follows.(1)The process of obtaining the page is illustrated in the paper.Meanwhile, common segmentation algorithm and feature extraction algorithm are also introduced briefly. In addition, some classification algorithms which are common in Web text mining are summarized.Then, the advantages and disadvantages of various algorithms that exist in the application are analyzed.(2) Taking into account the classification accuracy and efficiency, the paper selects naive Bayes algorithms and improve it. The principles and insufficient of naive Bayes algorithm are mainly analyzed:Naive Bayes algorithm is established on the basis of property independence. But it doesn’t exist in reality, so it should be improved. Under the assumptions that properties are independent, the improved algorithm considers the frequency of characteristic item appeared in the whole data set.The algorithm adds a weighting factor effecting on the conditional probability of feature. Thus, the accuracy has been improved while its calculated amount doesn’t increase. Besides, the improved algorithm also has advantage in recall.(3) The feasibility of parallel computing of the naive Bayes classifier algorithm is analyzed in this paper, Combined with the common framework MapReduce of large data processing, detailed design process and implementation for the naive Bayes classifier have been made. Experimental environment is built in pseudo-distributed mode and the design scheme gets the implementation finally. Thus, processing speed has been significantly improved when the algorithm deals with massive text data.(4)At last, experiment of improved classification algorithm has been done in the JAVA language. The effect of the classification of the improved algorithm is evaluated combining with the experimental results, and the desired effect is achieved.In short, the paper analyzes the various aspects of Web text classification. The paper is also focus on researching and improving the naive Bayes classifier algorithm and realizes it in the environment of big data. Finally, the papers compares the results of the Naive Bayesian classification algorithm and Naive Bayes classification. The results prove that the improved algorithm performance has been improved... |