Font Size: a A A

Research On Class-imbalanced Text Classification Algorithm Based On Improved BiGRU

Posted on:2020-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:J Y LinFull Text:PDF
GTID:2428330596995460Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the internet generates a huge amount of data everyday.Data classification is an important means of information management,while text classification is an important means of data classification.However,in practice,because of data collection methods and other reasons,text data that need to be categorized are often imbalanced.The number of samples in different categories varies greatly.In training step,if the features of a class wih a small number of samples(minority class)are not obvious,the minority samples tend to be predicted as a class with a large number of samples(majority class)when predicting,which greatly affects the result of the classification.The general text classification algorithm rarely take into account the classimbalanced problems of the dataset which make it difficult to learn the features of minority classes.BiGRU is a kind of deep neural network.In the text classification task,text represented by low-dimensional word vector input into the network and the features of the text are extracted from the front and back directions.BiGRU has strong ability to extract text features.However,BiGRU is not designed specifically for class-imbalanced problems and cannot be applied well in class-imbalanced text classification tasks.Combining feature selection,undersampling and model ensemble,the three solutions of class-imbalanced classification problems,these paper improves the BiGRU model and propose a multichannel enhanced word vector BiGRU-Attention model to solve the class-imbalanced text classification problems.(1)In feature selection aspect,it extracts the feature words of each category by CHI testing and then the words in the text are mapped to the category vector representation of the words and combined it with the word vector trained by Word2 vec to obtain the enhanced word vector with category information as the training data of the model.(2)Introduce the attention mechanism in BiGRU to get the BiGRU-Attention model in order to make the BiGRU model better assign weights to various parts of the text when extracting text features and give higher weights to the important parts.(3)In undersampling aspect,undersampling the samples of the majority classes to alleviate the problem that the features of the minority classesare overwhelmed by the features of the majority classes.(4)In model ensemble aspect,in order to avoid losing too many features of the majority classes caused by the undersampling which will affect the classification result,a multichannel model is applied.Firstly,several different groups of samples are generated by random undersampling and then input into the enhanced word vector BiGRU-Attention model of each channel.The model can learn more features.The fusing of the features from multiple channel will be the final features for the classification and after the computation of the fully connected layer and the softmax layer,we can get the classification result.Experiment on class-imbalanced text classification dataset show that compared with other algorithms,the proposed algorithm get the best result in the means of macro recall,means of marco F1-Score and the means of G-mean and achieves better classification results.
Keywords/Search Tags:Class imbalanced, Text classification, BiGRU, Attention mechanism
PDF Full Text Request
Related items