| Short text data has exploded in the mobile Internet era.However,due to its sparse features and multiple ambiguities,there are some problems with the current classification algorithms for short text,such as low precision rate and over-fitting.In particular,the existing methods for modeling short texts are difficult to effectively express and utilize complex external semantic information,resulting in the inability to extract deep-level semantic information.Therefore,from the perspectives of semantic representation and semantic expansion,this paper introduces a manifold re-embedding algorithm for word vectors,and proposes a topic modeling based on manifold learning,and then designs a short text modeling for semantic expansion on this basis to improve the precision rate and deal with over-fitting.Finally,to design and realize a set of short text classification system.The specific work of this paper and the results obtained include:(1)Research on three generations of text modeling technology.A topic modeling algorithm based on manifold-M-LDA model is proposed.First of all,in view of the current lack of semantic expression ability of word vectors in Euclidean space,a method based on manifold learning and re-embedding is proposed,and on this basis,it is used as prior knowledge in the initialization process of the LDA model.The model is optimized,and a latent Dirichlet model based on the manifold learning prior is constructed.The experiment result shows that the performance of M-LDA is improved by 6.5%and 8.7%on average compared with topic models such as LDA and DMM.(2)A short text modeling method based on semantic expansion--Set-CNN model is proposed.The main idea of this method is to expand the semantics of the keywords in the short text through a fast clustering algorithm,and then use different convolution kernels including hole convolution and residual threshold mechanism to process the expanded text to ensure the semantic expansion On the basis of minimizing the introduction of noise,the text convolutional neural network is finally used for short text modeling.The experiment result shows that our algorithm achieves the best performance among the six benchmark models,which confirms the rationality and effectiveness of the model.(3)A set of short text classification system is designed and implemented,and Sogou news headlines are crawled based on the data acquisition module as a data set for system testing.The system is based on two algorithms,M-LDA and Set-CNN,and includes modules such as text acquisition,text preprocessing,word vector pre-training,semantic expansion,and short text classification.The system test results show that the accuracy of the short text classification algorithm based on semantic expansion proposed in this paper is about 5%higher than that of LSTM.This paper also compares the algorithms horizontally,and the results show that the classification accuracy of the expanded text is 22.6%higher than that of the text before expansion.It further proves that the system designed in this article has a high use value in practical applications. |