Font Size: a A A

Research On Hierarchical Data Augmentation And Learning Method For News Text Classification

Posted on:2022-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2518306572997149Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification is one of the most concerned frontier issues in the current academic and industrial circles in natural language processing.Many general text classification algorithms rely heavily on datasets with sufficient data and balanced classes.However,in face of the situation that the data is not sufficient and classes are imbalanced,the text classification algorithms are not effective and poorly robust.In addition,because news has the characteristics of a large text style span,long texts with rich semantics and diverse expressions,text classification for the news field has become a big challenge.Based on the above problems,the hierarchical data augmentation and learning method framework HDAL is designed to be applied to news text classification tasks.The hierarchical data augmentation model realizes the double-layer "text-feature" data augmentation.In the text layer,combining methods based on statistics,graphs and latent semantics,the hierarchical data augmentation model uses text extraction algorithms in the field of data augmentation.It combines news headline information and information entropy algorithm for improvement,and uses linear programming to obtain augmented-text by setting redundant constraints.In the feature layer the hierarchical data augmentation model adopts the Mixup method,which generates new samples near the neighborhood of small sample points by performing linear interpolation on the points mapped by the existing samples in the feature space.The hierarchical learning method ensures the relative balance of the number of different classes in the classification process of each layer by separating the large sample class from other classes.Combining the two classification tasks,the method of using heavy weighting to set the proportional coefficient of the loss function reduces the cost of algorithm.Confusion and interference caused by majority sample classes to other classes in the learning process.On the two news data sets of NSDC and 20 News Group,the hierarchical data augmentation and learning method framework HDAL is tested for text classification.The experiments show the results that the HDAL framework improves the F1 of the text classification algorithms by 2% to 5%.Compared with the data augmentation algorithm EDA,the HDAL framework improves the F1 of the text classification algorithms by more than 1%,and it takes less time.
Keywords/Search Tags:Text Classification, Data Augmentation, Hierarchical Learning, Class Imbalance
PDF Full Text Request
Related items