| With the continuous development of Internet technology, especially the popularity of online shopping, the network appears a large number of comment texts about the products. Users’ sentiment orientation praise or denounce can be obtained through these texts about products. The businesses not only can mine the user’s concerns and selling point from commendatory comments, but also own shortcomings and lack of competitors can be found from derogatory comments. Face to the big quantity difference between commendatory texts and derogatory texts directly using traditional classification techniques to classify the effect is not ideal, especially for those very important minority class sample identification. Therefore, how to accurately classify the review text for sentiment classification becomes an urgent problem.Aiming at the imbalanced text data for sentiment classification, the thesis focus on the data level sampled to enable the data to achieve a balance. After that, use conventional classification techniques to classify the data. The main contribution of the article followed the three points:(1) Cluster-based under-sampling algorithmIn this paper, we propose a cluster-based under-sampling algorithm, CUA. The method cluster the majority class of the training data set, after that, randomly select representative sample from each cluster to make the training data balance. Compared with without cutting samples WCS and random under-sampling RS methods, it turned out that:①Under-sampling method in dealing with the imbalanced data for sentiment classification is necessary.②CUA method does not substantially change the distribution of data, so it is more stable than RS.(2) Boundary region cutting algorithmWe propose a boundary region cutting algorithm for text sentiment classification. The main idea of the proposed algorithm is cutting some majority class texts in the high density boundary region to make the boundary clear. To check the validity of the proposed method three groups of experiments are designed on six text sets. The results show that:①Compare the TFIDF, TF and Presence feature weight schemes, Presence has the best average performance.②Study the impact of the parameters a and βin BRC to text sentiment classification. Our experiments show that, the smaller parameter values the better BRC effect. In BRC+RS method, we should select a bigger value than BRC to get a good performance.③Compare the performance of the BRC and BRC+RS, we find that:BRC can indeed enhance the recall value of minority category. However the recall value of majority category and the precision of minority category will be reduced to a certain extent. BRC+RS can make the synthetical evaluation measure F1obtain a larger increase. As a whole, BRC+RS has the better performance than BRC.(3) Experiment schemes for Imbalanced data cutting methodsFor imbalanced text comments sentiment data, we design verification and test two experimental schemes. The test scheme is divided into two cases of balanced and unbalanced test set. Using RS, CUA and BRC+RS three under-sampling methods, the experiments indicate that:①After BRC+RS method cutting data, it is more conducive than the RS and CUA for class distinction.②Compare scheme2.1and2.2, we get the identify ability of classifier to the minority class of imbalanced data is not inferior to balance data.③Using scheme2.2, we get BRC+RS is better than RS and CUA. Because BRC+RS cutting the data from class boundary, it make data easier to category distinguish. |