| Text categorization is a key technology for text information processing in the field of natural language processing.It is mainly composed of text representation and classification model(algorithm).In today's era of rapid growth of textual information,text categorization plays a major role in effectively,conveniently and quickly obtaining the information needed.As one of the main carriers of text information,short text has the characteristics of short length,feature sparsity,dynamic,real-time,and irregular format.Therefore,traditional machine learning algorithms based on word bag feature or vector space cannot effectively extract short text.Features,which in turn affect the classification effect.In recent years,the use of the deep feature learning model's powerful feature extraction ability for text categorization has become a research hotspot.Based on the convolutional neural network model and the text representation method of word vector,this paper studies the related technical points of Chinese short text classification,and the related research results are as follows:1.Proposes a word vector model applied to convolutional neural network text classification.Text feature extraction(text input representation)is the main point of text classification technology,and its construction quality directly affects the classification effect of the classification system.Nowadays,the most popular text input representation-Word Vector considers the relevance and similarity between words,but ignores the contextual word order features,and in some cases causes the semantic loss and distortion of the text.To this end,this paper proposes a word vector model WordNGVec that combines N-Gram features with Word2 vec,and extracts the word vector(Word-NG vector)as a two-channel convolutional neural network model(DC-CNN).Input.After several sets of comparative experiments,it is shown that the proposed method can effectively improve the effect of text classification under the three evaluation indexes of precision and recall and F1.2.Proposes a text classification model based on regularized hierarchical Softmax convolutional neural network.The output layer of the traditional convolutional neural network classification model(CNN)adopts the standard Softmax of the flat architecture.In the text classification task with large amount of data and many categories,the computational complexity is high and the training takes a long time.The improved algorithm based on huffman tree,Hierarchical Softmax(H-Softmax),can greatly improve the training speed.However,due to the addition of a large number of node parameters,the optimization difficulty increases,and the optimization requires longer iteration.Steps,and easy to overfit,which in turn affects the model's fitting speed and classification effect.To this end,this paper proposes an improved algorithm model RHS-CNN(Regularization Hierarchical Softmax CNN),using the regularization method to constrain the node parameters of H-Softmax,avoiding over-fitting and enhancing the generalization ability of the model.The experimental analysis shows that the proposed method has a certain improvement on Softmax and H-Softmax in the corresponding evaluation indicators. |