Research On Imbalanced Data Sampling Methods For Text Sentiment Classification

Posted on:2014-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:L D Zhao

Full Text:PDF

GTID:2268330401462545

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the continuous development of Internet technology, especially the popularity of online shopping, the network appears a large number of comment texts about the products. Usersâ€™ sentiment orientation praise or denounce can be obtained through these texts about products. The businesses not only can mine the userâ€™s concerns and selling point from commendatory comments, but also own shortcomings and lack of competitors can be found from derogatory comments. Face to the big quantity difference between commendatory texts and derogatory texts directly using traditional classification techniques to classify the effect is not ideal, especially for those very important minority class sample identification. Therefore, how to accurately classify the review text for sentiment classification becomes an urgent problem.Aiming at the imbalanced text data for sentiment classification, the thesis focus on the data level sampled to enable the data to achieve a balance. After that, use conventional classification techniques to classify the data. The main contribution of the article followed the three points:(1) Cluster-based under-sampling algorithmIn this paper, we propose a cluster-based under-sampling algorithm, CUA. The method cluster the majority class of the training data set, after that, randomly select representative sample from each cluster to make the training data balance. Compared with without cutting samples WCS and random under-sampling RS methods, it turned out that:â‘ Under-sampling method in dealing with the imbalanced data for sentiment classification is necessary.â‘¡CUA method does not substantially change the distribution of data, so it is more stable than RS.(2) Boundary region cutting algorithmWe propose a boundary region cutting algorithm for text sentiment classification. The main idea of the proposed algorithm is cutting some majority class texts in the high density boundary region to make the boundary clear. To check the validity of the proposed method three groups of experiments are designed on six text sets. The results show that:â‘ Compare the TFIDF, TF and Presence feature weight schemes, Presence has the best average performance.â‘¡Study the impact of the parameters a and Î²in BRC to text sentiment classification. Our experiments show that, the smaller parameter values the better BRC effect. In BRC+RS method, we should select a bigger value than BRC to get a good performance.â‘¢Compare the performance of the BRC and BRC+RS, we find that:BRC can indeed enhance the recall value of minority category. However the recall value of majority category and the precision of minority category will be reduced to a certain extent. BRC+RS can make the synthetical evaluation measure F1obtain a larger increase. As a whole, BRC+RS has the better performance than BRC.(3) Experiment schemes for Imbalanced data cutting methodsFor imbalanced text comments sentiment data, we design verification and test two experimental schemes. The test scheme is divided into two cases of balanced and unbalanced test set. Using RS, CUA and BRC+RS three under-sampling methods, the experiments indicate that:â‘ After BRC+RS method cutting data, it is more conducive than the RS and CUA for class distinction.â‘¡Compare scheme2.1and2.2, we get the identify ability of classifier to the minority class of imbalanced data is not inferior to balance data.â‘¢Using scheme2.2, we get BRC+RS is better than RS and CUA. Because BRC+RS cutting the data from class boundary, it make data easier to category distinguish.

Keywords/Search Tags:

Text sentiment classification, Imbalanced text set, Clustering, Under-sampling method, Boundary region

PDF Full Text Request

Related items

1	Text Classification Algorithm Based On Imbalanced Data Sets
2	Research On Imbalanced Data Classification Method And Its Application In Sentiment Classification Of MOOC Course Comment
3	Research On Feature Generation Methods For Text Sentiment Classification
4	Research On Imbalanced Text Classification
5	Research On Text Sentiment Clustering Method Based On Dimension Identification
6	Text Sentiment Classification Model Based On Deep Learning
7	Methods Based On Combined Deep Neural Networks For Text Sentiment Analysis
8	Research Of Text Clustering And Classification Method Based On Genetic Annealing Algorighms
9	Research On Text Sentiment Classification Method Based On Deep Learning
10	Multi-granular Text Sentiment Classification For Method Research Based On Machine Learning