Research On Improved Feature Selection And Classification Algorithm For Chinese Text

Posted on:2022-07-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Wan

Full Text:PDF

GTID:2518306575968569

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet technology,more and more people obtain information through the internet,and most of these internet resources exist in the form of text.Faced with massive document resources,users need to spend too much time to dig out information that is valuable to them,which seriously affects the quality of life.As a key technology for processing big data,text classification technology can help us quickly mine text resources.However,with the increase in the amount of network data in the form of text,there are too many interference samples and noise features in the cluttered text,which not only reduces the efficiency of the classification model,but also affects the accuracy,giving us the need for rapid positioning the information caused a lot of trouble.Based on this,in order to solve the problem of limited classification performance and excessive classification time overhead in text classification,this thesis conducts research from the following two stages of feature selection and classification:1.In view of the large amount of noise and redundant features in text data,in order to obtain a more representative feature set,a feature selection method combining improved chi-square statistics and principal component analysis is introduced.First,the chi-square algorithm ignores the problem of word frequency,document length,category distribution and negative correlation characteristics,and introduces corresponding adjustment factors to improve the chi-square calculation model;then use the improved chi-square calculation model to evaluate the features and select the top features as the primary selection feature set;finally,the principal component analysis method is used to extract the main components while basically retaining the original information to achieve dimensionality reduction.Experiments have verified that compared with traditional feature selection algorithms and similar methods,the method proposed in this thesis achieves an improvement in classification performance under multiple feature dimensions and multiple categories.2.In order to solve the problem of excessive calculation time and limited classification performance of the KNN algorithm in the classification as the number of texts increases,a weighted KNN classification algorithm based on K-medoids algorithm for sample selection is introduced.In sample selection,first use the K-medoids algorithm to cluster the text,and then screen the samples in two cases according to the similarity between the sample to be tested and the center sample of each cluster,and select the sample with high similarity to reduce the number of training samples.In classification judgment,each neighboring text is given the amount of category information it represents by the similarity,and the problem of class tilt is solved according to the number of samples,and the weighted KNN is implemented to improve the decision function.The experimental results show that this method can effectively select samples and reduce the time overhead while ensuring the classification performance.

Keywords/Search Tags:

text classification, feature selection, chi-square statistics, K-medoids, KNN

PDF Full Text Request

Related items

1	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
2	Research On Local Feature Selection Of Chinese Text
3	Research On Sentiment Text Classification For Product Reviews
4	Study On The Text Classification Feature Selection Method-the Uyghur Language
5	Classification Research On News Text Classification Based On Feature Selection Method
6	Research And Improvement Of Feature Selection Algorithm In Chinese Text Classification
7	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
8	Research On Chi-square Statistic Feature Selection Method And TF-IDF Feature Weighting Method For Chinese Text Classification
9	The Research On Text Categorization Technology Based On Partial Least Square
10	Research And Implementation Of Chinese Text Classification, Feature Selection Method,