Font Size: a A A

Research On Data Augmentation Algorithm For Imbalanced Data Classification

Posted on:2024-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y J SunFull Text:PDF
GTID:2568307064986079Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The problem of imbalanced data classification has become increasingly common and urgent in the era of big data.In traditional machine learning,classification problems usually assume that the dataset is balanced and seek accurate classification results.However,when the number of majority class samples is much larger than that of minority class samples,the classifier tends to learn the features of the majority class,resulting in the features of the minority class being ignored,which affects the performance of the classifier.Therefore,correctly identifying minority class samples to ensure that the final classification performance has a balance of multi-class recognition has become one of the research focuses in classification problems.In reallife scenarios,accurately identifying minority class samples has an important impact on decision-making,such as rare disease diagnosis in the medical field,fraud detection and customer churn prediction in the financial field.In addition,considering other evaluation metrics,such as precision and recall,is also necessary to obtain comprehensive and accurate evaluation results when evaluating classifier performance.Therefore,correctly handling imbalanced data classification problems is very necessary and meaningful in the era of big data.This article proposes a hybrid resampling algorithm for imbalanced data classification problems for structured data,which combines oversampling based on conditional generative adversarial networks and undersampling based on distance screening.By migrating the generative adversarial network from unstructured data to structured data and using the conditional generative adversarial network to generate new data of a specified class,and adding generated samples that are within a certain distance threshold to the original imbalanced training set,the imbalance level is reduced,and the imbalanced data classification problem is solved.Experimental results show that the algorithm performs better than 12 other imbalanced data classification methods on 37 imbalanced data sets,proving the feasibility of the algorithm’s migration to the field of structured data.In addition,this article also investigates the problem of imbalanced data classification existing in the field of biology.Because of the characteristics of biomics data with small number of samples and large number of features,the algorithm RCGAN-DF is prone to overfitting when training on data sets with large number of features,so this article also proposes a data augmentation and feature selection algorithm based on mi RNA omics.The algorithm utilizes the important role of mi RNA in gene regulation,combines the relationship between mi RNA and target genes in gene expression data,and uses an effective method to enhance the dataset and select meaningful features for classification,and the reduction of the number of features is beneficial to reduce the overfitting of the training model.Experimental results show that the algorithm performs better than traditional data augmentation and feature selection algorithms on three mi RNA datasets,and identifies significant biomarkers,proving the feasibility of the algorithm in the field of bioinformatics.This algorithm can provide powerful support for disease diagnosis and treatment,and provide new ideas and methods for future bioinformatics research.
Keywords/Search Tags:imbalanced data classification, conditional generative adversarial networks, deep learning, microRNA, feature selection
PDF Full Text Request
Related items