Font Size: a A A

Research On Chinese Text Representation And Classification Based On Deep Learning

Posted on:2024-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:M ZouFull Text:PDF
GTID:2568307055977639Subject:Electronic Information (Electronics and Communication Engineering) (Professional Degree)
Abstract/Summary:PDF Full Text Request
With the widespread application of internet technology,internet platforms generate a large amount of text data every day,which contains rich information.How to accurately and efficiently classify a large amount of text data and mine important information in it is an urgent problem to be solved.Text classification technology can effectively manage a large amount of text data,which is crucial for many fields such as sentiment analysis,topic classification,and information retrieval.The text data generated by the various fields can be divided into short and long texts,depending on the length of the text.The main difficulty of Chinese short text classification lies in how to fully extract text feature information;However,Chinese long texts have problems with large information content and uneven distribution of feature information.In response to the above issues,this article conducts research on the classification tasks of Chinese short text and Chinese long text,with the main research content as follows:(1)A Chinese short text classification model based on ALBERT and BiGRU-CNN is proposed to address the problem that a single neural network model cannot fully extract text feature information and common word vector models cannot solve the problem of multiple meanings of a word.The model first uses ALBERT pre-training model to generate dynamic word vectors to solve the impact of multiple meanings on text classification,then uses BiGRU,a bi-directional gated recurrent unit,to extract global semantic information of text,and then uses CNN,a convolutional neural network,to perform convolutional pooling operation on the hidden state representation of BiGRU to extract local semantic information of text to improve the feature extraction capability of the model.The proposed model is evaluated with two publicly available Chinese short text datasets,and the classification results are compared with several models in the experiments.The results show that the proposed model achieves better classification results.(2)In order to improve the representation ability of word vector models for long texts and better utilize deep neural networks to complete long text classification tasks,a Chinese long text classification model based on dual channel feature fusion is proposed.This model uses both Word2vec-BiGRU-Attention and TextRank-ALBERT channels for text representation and feature extraction.The Word2vec-BiGRU-Attention channel uses the Word2vec method to train word vectors,and the BiGRU-Attention model is used to obtain the feature vectors of the entire text;The TextRank-ALBERT channel first extracts key sentences from long texts using the TextRank algorithm to form key sentence texts,which are used as input for the ALBERT pretraining model to obtain the feature vectors of the key sentence texts;Fusion of feature vectors from two channels to obtain more comprehensive and important text feature information,enhancing the model’s text representation ability.The effectiveness of the proposed model was verified through multiple sets of experiments on both the publicly available Sohu News dataset and the personally constructed Sina News dataset.In summary,the Chinese short text classification method based on ALBERT and BiGRUCNN and the Chinese long text classification method based on dual-channel feature fusion proposed in this paper have good performance and can better solve some problems in the Chinese short classification task and the Chinese long text classification task.
Keywords/Search Tags:deep learning, feature extraction, Chinese text classification, text representation
PDF Full Text Request
Related items