| With the rapid development of computer technology,the Internet,and mobile devices,the number of Internet users has surged.And the role that people play has also changed,from a consumer of information to a producer and consumer of information.Every day,massive amounts of text data are generated from platforms such as social software and information websites.How to automatically extract valuable information from these massive text data is a hot research topic.It has important research significance and practical value for scientific research institutions,government departments,Internet providers and other institutions.Since the concept of deep learning was proposed,deep learning technology has successively achieved very significant results in computer vision,speech recognition,natural language processing and other fields.Various studies have shown that deep learning methods are better than traditional machine learning methods in many aspects.At present,combining the advantages of different deep learning methods to extract text features more fully is one of the research hotspots in text classification.This paper combines the different feature extraction methods and text data structure of the convolutional neural network and recurrent neural network in deep learning.A hybrid model is designed to extract features from text data,hoping to achieve the purpose of extracting richer features and improving the classification effect.The main contents are summarized as follows:(1)Introduced the basic concepts,processes and related technologies of text classification,mainly researching convolutional neural networks and recurrent neural networks.Use the THUCNews data set,first process it in a unified format,and then use the precise word segmentation mode of the Jieba word segmentation tool to segment the text.In terms of text representation,in order to solve the problem of the sparsity of traditional representation and the lack of contextual semantic information,the Word2Vec tool is used for word vector training.Since the Skip-gram model in Word2Vec is more fully trained than the CBOW model,the Skip-gram model was selected to train word vectors of three dimensions:100,150,and 200.In the comparative experiment of different word vector dimensions,it is found that the 150-dimensional word vector is better than the 100-dimensional word vector.However,due to the small data set and hardware constraints,the performance of the 200-dimensional word vector in the model is not as good as the 150-dimensional word vector.Finally,the training of convolutional neural network and recurrent neural network,as well as the optimization of hyperparameters such as convolution kernel size,RNN hidden layer dimension,Batchsize,Dropout,etc..The results show that the BiLSTM model has a slightly better classification effect than CNN and LSTM.(2)Research on the structure of text data found that the text title can provide very important information.In order to make the neural network model more fully utilize the information in the title,a hybrid model of CNN and BiLSTM is designed.In the hybrid model,BiLSTM is used to extract the features of the title,CNN is used to extract the text content information,and then the features are spliced for softmax classification.During the tuning process,the optimal title length was determined to be 12 through comparative experiments.Then the final experimental results of the mixed model are compared with the results of the single model and the comparison model.The results show that the accuracy,recall and F1 value are all improved compared to CNN,LSTM,BiLSTM and the comparison model.The experimental results prove that the hybrid model separates the title and content of the text,and uses BiLSTM and CNN models to extract features,which can extract richer features and improve the effect of text classification.At the same time,the comparative experiment of different title lengths is compared with the experimental results of the CNN model,which also confirms the conclusion that the title information can affect the classification results. |