| With the development of Internet,there has been a lot of comment text data on the Internet.In the actual collection of reviews dataset,it is often found that the number of different class of texts is quite different,that is,the distribution of data is unbalanced.If the text is classified by the traditional classifier,the classification results are not ideal,especially for the very important minority samples.Based on the significance of unbalanced text data mining,this thesis enumerates and analyzes the research status at home and abroad,introduces the algorithm principle based on resampling and the related technology of text preprocessing,and designs an unbalanced text classification system,the specific contents are as follows:(1)Text preprocessingFor text classification,text data cannot be used directly for the construction of subsequent classifier.Therefore,the system designed a preprocessing method for Chinese and English data sets,such as Chinese word segmentation,abandon stop words and vector-creating method.(2)The implementation of the resampling methodThis thesis introduces and implements six typical unbalanced classification techniques based on resampling technology,expounds the advantages and disadvantages of each method.After that,a comparative experiment is carried out and the specific experimental process and experimental results are given.(3)Implementation of text classification systemIn this thesis,an unbalanced text classification system is built based on Java language and mixed programming of Matlab and Python.The main functions include the following parts: selection of text input path,Chinese text segmentation,feature selection,text to quantizing representation,cross validation,resampling method selection,display evaluation index,path setting of data results and so on.In order to take account of practicalapplications and research experiments,the system is designed as an optional module with text preprocessing and non-equilibrium methods.Users can select functions according to the demand of categorization of unbalanced text data.The development of unbalanced systems provides an auxiliary system for researchers to study unbalanced text classification,also can be used for ordinary users.So it has a certain value of application. |