Research And Application On The Technology Of Web Text Mining

Posted on:2016-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:X D Li

Full Text:PDF

GTID:2298330470455581

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet, there are a large number of websites every day. A lot of Web sites lead to huge amounts of Web pages. The information contained in these Web pages is very broad, and the content is different. So how to accurately obtain the information from large Web pages is the key to improve the efficiency of our study and work, and Web text categorization plays an extremely important role in this case.Combined with the Chinese text classification process, the paper expounds and summarizes from five aspects including obtaining the pages, Chinese word segmentation, feature extraction, improvement and implementation of classification algorithm. The main work of the paper is stated as follows.(1)The process of obtaining the page is illustrated in the paper.Meanwhile, common segmentation algorithm and feature extraction algorithm are also introduced briefly. In addition, some classification algorithms which are common in Web text mining are summarized.Then, the advantages and disadvantages of various algorithms that exist in the application are analyzed.(2) Taking into account the classification accuracy and efficiency, the paper selects naive Bayes algorithms and improve it. The principles and insufficient of naive Bayes algorithm are mainly analyzed:Naive Bayes algorithm is established on the basis of property independence. But it doesnâ€™t exist in reality, so it should be improved. Under the assumptions that properties are independent, the improved algorithm considers the frequency of characteristic item appeared in the whole data set.The algorithm adds a weighting factor effecting on the conditional probability of feature. Thus, the accuracy has been improved while its calculated amount doesnâ€™t increase. Besides, the improved algorithm also has advantage in recall.(3) The feasibility of parallel computing of the naive Bayes classifier algorithm is analyzed in this paper, Combined with the common framework MapReduce of large data processing, detailed design process and implementation for the naive Bayes classifier have been made. Experimental environment is built in pseudo-distributed mode and the design scheme gets the implementation finally. Thus, processing speed has been significantly improved when the algorithm deals with massive text data.(4)At last, experiment of improved classification algorithm has been done in the JAVA language. The effect of the classification of the improved algorithm is evaluated combining with the experimental results, and the desired effect is achieved.In short, the paper analyzes the various aspects of Web text classification. The paper is also focus on researching and improving the naive Bayes classifier algorithm and realizes it in the environment of big data. Finally, the papers compares the results of the Naive Bayesian classification algorithm and Naive Bayes classification. The results prove that the improved algorithm performance has been improved...

Keywords/Search Tags:

Web Text Mining, Classification, Improved Naive Bayes Classifier, MapReduce, Big Data

PDF Full Text Request

Related items

1	Data Mining Systems And Their Applications - Improve The Performance Of The Naive Bayes Text Classifier, Associated Characteristics
2	Research On Text Mining Based On MapReduce
3	Research And Application On Naive Bayes Classification Algorithm
4	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
5	A Text Classifier About High Blood Pressure Based On Naive Bayes
6	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
7	Naive Bayes Classification And Application Based On Improved K-means Algorithm
8	Improvement Of Navies Bayes Text Classification Algorithm Based On Unbalanced Dataset
9	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
10	Research On Improved Naive Bayes Classification Model For Imbalanced E-commerce Review Text