| In the era of information explosion,there are various forms of information,among which text information appears more and more in people’s daily life.In the face of massive text data,it is very tedious and time-consuming to classify only by manual annotation,which needs to use some tools to sort out and help us find the real valuable information quickly and efficiently.Text classification is the research around such a practical problem.This paper mainly studies how to classify text information by machine learning algorithm.By selecting large-scale text data,these texts are divided into different categories by using the content information of feature extraction and mining.This paper focuses on how to optimize the feature selection method in text classification.The main work is as follows:1.The traditional TF-IDF feature selection method ignores the correlation between feature words and categories,and does not consider the distribution of feature words between categories or within categories.In this paper,chi square value is added to TF-IDF formula to measure the correlation between feature words and text categories,and the parameters of concentration degree between categories and dispersion degree within categories are added to make up for the lack of ignoring feature words and category distribution in traditional TF-IDF Finally,an improved feature selection method named C-TFIDF is established.2.When using the traditional feature selection method,the feature dimension needs to be set artificially,which may make the selected feature fall into the local optimal solution and can’t accurately reflect the text information.In order to avoid this situation and reduce the feature dimension,this paper uses genetic algorithm to further optimize the feature subset.The feature set obtained by C-TFIDF method is used as the initial search starting point of genetic algorithm,and the accuracy of text classification is used as the fitness function to search more representative features of text category.To some extent,this method improves the quality of features,so as to improve the accuracy of text classification.In the experiment,this paper collects and arranges some commonly used stoppages.On this basis,we add some dynamic stoppages to complete the text data preprocessing.The classifier is trained and tested by naive Bayes,SVM and neural network algorithm,and the classification results are evaluated and analyzed by the indexes of precision,recall and measure value.Experiments show that the improved C-TFIDF feature selection method and the genetic algorithm based feature selection method can get more accurate classification results than the traditional method. |