| With the rapid spread of the Internet and the rise of social networks,the amount of information in the form of short texts such as text messages,microblogs,news websites,and forums has grown dramatically.The emergence of short texts has brought new challenges to the study of texts since the large amount of short text data contains various viewpoints and positions on various phenomena of society.And the topics involve politics,economy,military,entertainment,life and other fields.The study of these short texts of different types can provide corresponding solutions for researches such as topic tracking and discovery,Internet information supervision,buzzword analysis,public opinion warning and public opinion guidance.Through summarizing and analyzing the present situation of short text representation and classification,this paper has carried out in-depth research from two aspects of short text representation and classification,and has obtained the following research results:1.Aiming at the drawback of traditional high-dimensional sparse representation of short text,a short text representation learning method based on semantic feature space context,SFCR,is proposed.Considering the high dimension of the initial feature space,the semantic clustering of terms is performed based on the mutual information and co-occurrence between terms.And the semantic feature space can then be represented via the cluster center.Then,the context information is integrated via the semantic feature space,based on which three kinds of similarity calculation method are established to compute the similarity between terms of the short text to be represented and the feature term in the feature space.Thereafter the text mapping matrix is constructed for short text representation learning.The experimental results show that the proposed method can well reflect the semantic information of short texts and represent the short texts reasonably and effectively.2.Different from the classification method of extending short text or using additional information to avoid short sparseness problem,we propose a short text sparse representation classification method with entropy weight constraint.The original dictionary dimension is too high and there is redundant data,we first use Word2vec tool to represent the words in the dictionary as word vector,and the dimension reduction for the original dictionary is performed based on the weighted average vector.Secondly,the dictionary is filtered by using a fast feature subset selection algorithm to remove irrelevant and redundant short texts in the dictionary.Thirdly,on the filtered dictionary,a sparse representation method of entropy weight constraint is designed for the objective function,and the Lagrange multiplier method is introduced to obtain the optimal value of the objective function,so that the subspace of each class is obtained.Finally,under the learned subspace,the distance between the short text to be classified and the short text in each class is calculated,and the short text is classified according to three classification rules.The experiments show that the proposed method can significantly improve the short text classification efficiency and is better than the existing short text classification method. |