Font Size: a A A

Research On Imbalanced News Text Mining Based On Improved Random Forest

Posted on:2021-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z C ZhaiFull Text:PDF
GTID:2517306092496744Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,people are used to viewing kinds of information through online platforms,which has led to the production of news websites.Since most types of news data are unstructured,it is difficult for ordinary computer technology to handle them directly.Text classification is an important technology to processing text data.Random Forest is an integrated learning method by combining multiple decision trees,which has high classification accuracy,small generalization error,good at processing large-scale data.Based on these advantages,Random Forest has been widely used in various disciplines and has been concerned by scholars from kinds of fields.News data has high dimension obtained from the text feature process,and the Random Forest is suitable for processing this kind of text data.But the sample size of some kind of news is too small,the Random Forests exists some disadvantages in dealing with unbalanced data.First,the results of classification will be biased the majority categories,which lead to the classification accuracy of minority categories significantly was lower.Second,Random Forest is composed of multiple decision trees which had different performance,but the voting weights of all the decision trees are the same.In view of these,with text similarity based on Word2 vec,this paper improves the Random Forest from the construction of balanced sample data and the weighted voting of decision tree.(1)In order to construct an equilibrium sample space,this paper combines the SMOTE method and the Document Similarity of Word2 vec to find qualified samples of the minority categories and their neighboring samples by different area,and synthetic new samples by linear interpolation.(2)In the Random Forest,based on the constructed equilibrium sample space,the OOB of each decision tree is transformed according to the Document Similarity,and the new OOB is classified by the decision tree,and the weighted voting is given to the decision tree according to the classification accuracy.(3)In order to improve the predictive ability of the model,the improved Random Forest proposed in this paper,combinined the method of Bagging Ensemble Learning,parallelized several base learners,summaried the all results by equal weight voting,and finally form a strong classifier.The experimental results show that the classification performance of Random Forest which based is better than ordinary Random Forest algorithm in dealing with unbalanced news data.Compared with a single classifier,the ensemble improved Random Forest has a better generalization ability.
Keywords/Search Tags:News Text Classification, Unbalanced Data Set, Random Forest, Document Similarity, Ensemble Learning
PDF Full Text Request
Related items