Research On Text Classification Based On Optimized Feature Selection Algorithm

Posted on:2021-10-04

Degree:Master

Type:Thesis

Country:China

Candidate:X D Jiang

Full Text:PDF

GTID:2558306917482074

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

In the era of information explosion,there are various forms of information,among which text information appears more and more in people’s daily life.In the face of massive text data,it is very tedious and time-consuming to classify only by manual annotation,which needs to use some tools to sort out and help us find the real valuable information quickly and efficiently.Text classification is the research around such a practical problem.This paper mainly studies how to classify text information by machine learning algorithm.By selecting large-scale text data,these texts are divided into different categories by using the content information of feature extraction and mining.This paper focuses on how to optimize the feature selection method in text classification.The main work is as follows:1.The traditional TF-IDF feature selection method ignores the correlation between feature words and categories,and does not consider the distribution of feature words between categories or within categories.In this paper,chi square value is added to TF-IDF formula to measure the correlation between feature words and text categories,and the parameters of concentration degree between categories and dispersion degree within categories are added to make up for the lack of ignoring feature words and category distribution in traditional TF-IDF Finally,an improved feature selection method named C-TFIDF is established.2.When using the traditional feature selection method,the feature dimension needs to be set artificially,which may make the selected feature fall into the local optimal solution and can’t accurately reflect the text information.In order to avoid this situation and reduce the feature dimension,this paper uses genetic algorithm to further optimize the feature subset.The feature set obtained by C-TFIDF method is used as the initial search starting point of genetic algorithm,and the accuracy of text classification is used as the fitness function to search more representative features of text category.To some extent,this method improves the quality of features,so as to improve the accuracy of text classification.In the experiment,this paper collects and arranges some commonly used stoppages.On this basis,we add some dynamic stoppages to complete the text data preprocessing.The classifier is trained and tested by naive Bayes,SVM and neural network algorithm,and the classification results are evaluated and analyzed by the indexes of precision,recall and measure value.Experiments show that the improved C-TFIDF feature selection method and the genetic algorithm based feature selection method can get more accurate classification results than the traditional method.

Keywords/Search Tags:

Text classification, Feature selection, TF-IDF, Genetic algorithm, Machine learning

PDF Full Text Request

Related items

1	Research On Text Representation And Classification Based On Machine Learning Algorithm
2	A Study Of Text Classification Algorithms Based On Feature Selection
3	Research Of Automatic Text Classification Method Based On Machine Learning
4	Research Of Feature Extraction Technology In KNN Text Classification Based On The Genetic Algorithm
5	Research Of Feature Selection And Weighting Algorithm In Text Classification System Based On SVM
6	Genetic Algorithm Based Model Parameter Selection And Its Application In Text Classification
7	Research On Text Classification Algorithms Based On Machine Learning
8	Text Classification Feature Down-dimensional Method Of Research
9	Research On Text Classification Based-on Support Vector Machine
10	Research On Text Classification Methods Based On Extreme Learning Machine