Font Size: a A A

Research On Text Representation And Classification Algorithm Based On Model Integration

Posted on:2023-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:D Y DuFull Text:PDF
GTID:2557306833487134Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Within the swift developments of China’s education,online reports of educational information present the characteristics of large number and long content.Therefore,text classification technology has also been greatly impacted to some extent,and it can hardly be classified and sorted by the traditional way.In order to enable the relevant personnel concerned about education to browse the education news at a specific stage according to their needs,this paper benchmarks the text of educational news reports with labels Then,a text classification method is designed suitable for education news chief based on the analysis of the existing text classification algorithm.This paper mainly discuss from two aspects.(1)Aiming at the problems of information loss and high dimension in long text classification,this paper proposes lDA-D2V text representation method.Firstly,the topic distribution obtained through LDA training is mapped to the Doc2vec model to obtain a new topic vector.Then,Doc2vec model is used to train the document to obtain the document vector.Finally,the cosine similarity is used to measure the distance between the new topic vector and the document vector,and it is also used for text representation.While retaining the advantages of LDA model,the algorithm adds semantic information of text,so that the given vector can represent the text more completely.(2)This paper studies the text classification algorithm of CNN-BiLSTM network combined with attention model,in order to solve the problems that convolutional neural network(CNN)cannot use the context information of text in text classification,and circular neural network(RNN)cannot solve the problem of long-term dependence.The classification model combines the advantages of the two models.Firstly,CNN is used to extract the local features of the text information.Then,BiLSTM is used to extract the contextual information of the text,so as to extract the global feature information of the text.Finally,an attention layer is added to the end of the model in order to extract effective features from the model.The fusion model not only solves the problem that different words in the text have different effects on the classification results,but also improves the efficiency of classification.(3)The LDA-D2V text representation method and the CNN-BiLSTM-ATT classification model are compared in the online education news text collection.The experimental results show that the two models studied in this paper have better effect on the classification of educational news text sets,compared with the traditional models commonly used at present.
Keywords/Search Tags:Education News category, the LDA model, Doc2vec, CNN, BiLSTM
PDF Full Text Request
Related items