| Natural language processing is an important topic and direction in the field of computer science and other related professional fields,is mainly committed to various methods and ideas of effective communication between computer and human through natural language.Text classification,as one of the most important application fields of natural language processing,is closely related to our daily life,so the research on text classification also has extraordinary significance.TF-IDF,as a means of feature extraction in text classification,is widely used because of its simplicity and efficiency.TF-IDF is a commonly used feature weight calculation method in text classification,but the traditional TF-IDF calculation method has many shortcomings,which lead to its classification effect is not ideal:(1)The semantic situation of feature words is not considered.The location information,length and part of speech of feature words will affect the final classification results;(2)The distribution of feature words within the class is not considered.Whether the distribution of feature words within a class is even or not has an impact on the final classification effect;(3)The distribution of feature words among classes is not considered.The dispersion of feature words among different categories has an impact on the final classification results.In view of the above three shortcomings,this paper first proposes TFIDF-MW1 algorithm,which has the following characteristics:(1)In order to solve the problem that traditional TF-IDF does not consider the position of feature words and the part of speech of feature words,a weighting factor is designed to reduce the noise caused by local keywords;(2)Introduces the variance calculation formula of within class distribution,which solves the defect that traditional TF-IDF does not consider the distribution of feature words in its own class;(3)the calculation formula of variation coefficient of inter class distribution is introduced to solve the problem that the traditional TF-IDF does not consider the distribution of feature words among different classes.Aiming at the two shortcomings of TFIDF-MW1 algorithm,this paper proposes TFIDF-MW2 algorithm,which has the following characteristics:(1)for TFIDF-MW1 algorithm,only considering whether the distribution of feature words is uniform or not,but not considering the interference caused by word frequency when the degree of uniformity is similar,it introduces the correction factor of intra class distribution to supplement it;(2)for TFIDF-MW1 algorithm,only considering the distribution of feature words is discrete Without considering the effect of word frequency in the case of similar dispersion,the correction factor of inter class distribution is introduced to complement it. |