| With the rapid development of internet technology,more and more people obtain information through the internet,and most of these internet resources exist in the form of text.Faced with massive document resources,users need to spend too much time to dig out information that is valuable to them,which seriously affects the quality of life.As a key technology for processing big data,text classification technology can help us quickly mine text resources.However,with the increase in the amount of network data in the form of text,there are too many interference samples and noise features in the cluttered text,which not only reduces the efficiency of the classification model,but also affects the accuracy,giving us the need for rapid positioning the information caused a lot of trouble.Based on this,in order to solve the problem of limited classification performance and excessive classification time overhead in text classification,this thesis conducts research from the following two stages of feature selection and classification:1.In view of the large amount of noise and redundant features in text data,in order to obtain a more representative feature set,a feature selection method combining improved chi-square statistics and principal component analysis is introduced.First,the chi-square algorithm ignores the problem of word frequency,document length,category distribution and negative correlation characteristics,and introduces corresponding adjustment factors to improve the chi-square calculation model;then use the improved chi-square calculation model to evaluate the features and select the top features as the primary selection feature set;finally,the principal component analysis method is used to extract the main components while basically retaining the original information to achieve dimensionality reduction.Experiments have verified that compared with traditional feature selection algorithms and similar methods,the method proposed in this thesis achieves an improvement in classification performance under multiple feature dimensions and multiple categories.2.In order to solve the problem of excessive calculation time and limited classification performance of the KNN algorithm in the classification as the number of texts increases,a weighted KNN classification algorithm based on K-medoids algorithm for sample selection is introduced.In sample selection,first use the K-medoids algorithm to cluster the text,and then screen the samples in two cases according to the similarity between the sample to be tested and the center sample of each cluster,and select the sample with high similarity to reduce the number of training samples.In classification judgment,each neighboring text is given the amount of category information it represents by the similarity,and the problem of class tilt is solved according to the number of samples,and the weighted KNN is implemented to improve the decision function.The experimental results show that this method can effectively select samples and reduce the time overhead while ensuring the classification performance. |