| With the advent of the Internet age,information and data exist in various ways,such as images,videos,sounds,and texts.Compared with other forms,text is more widely used because of its faster upload and download speed and less network resources.In the massive text database,there are many important information stored.In order to obtain these data quickly and accurately,the automatic text classification technology came into being.Text classification has been widely used in various industries in recent years.It is very important to improve the accuracy of classification results,and this is also the main research purpose of researchers in this field in recent years.Feature selection plays an important role in text classification,which has the functions of eliminating irrelevant features,reducing dimensionality,and improving classification accuracy.It is the foundation of the text classification research field.Therefore,the performance of the feature selection algorithm will directly affect the formation of the feature space in the text classification system,thereby affecting the classification effect and accuracy.This paper studies the CHI feature selection algorithm,and mainly does the following work:First,a theoretical review of text classification is carried out.The research analyzes the definition,theoretical basis,overall framework and common algorithms of text classification;introduces several feature selection algorithms and analyzes their respective advantages and disadvantages;summarizes the optimization ideas of feature selection algorithms;systematically learns Chinese text Knowledge of classification and English text classification.Second,propose improved ideas and improved algorithms.This paper determines that the introduction of new parameters will be the improvement direction of the CHI algorithm,and thus proposes the Var-CV-CHI feature selection algorithm on variance and coefficient of variation.At the same time,this article also analyzes the shortcomings of the TF-IDF algorithm in the feature weighting link,and thus proposes the TF-CV algorithm,which effectively improves the effect of text classification.Third,the algorithm is implemented.In terms of language,the experiment in this article includes two text classification systems,Chinese and English.In terms of classifiers,the KNN algorithm and Bayes algorithm with the best classification results are selected.In terms of data types,two distribution types of data are used: balanced data sets and unbalanced data sets.Fourth,experiment and result analysis.This paper has done 8 comparative experiments and analyzed the results of each experiment.The experimental results obtained are all significantly improved compared to the results before the improvement. |