Font Size: a A A

Research On Data Feature Selecting And Data Balancing Methods Based On Genetic Algorithm

Posted on:2022-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:X Q WangFull Text:PDF
GTID:2518306329488474Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the advent of big data,data floods from all directions.In order to obtain useful information,people need to analyze data.However,the real life data is often redundant,if we use them directly,the model’s performance will be poor.Therefore,we should clean the redundant data before putting the data set into classifier.Data has many features,but not each feature has a positive effect.Some redundant features will not only increase the amount of calculation,but also may reduce the classification accuracy.The processing of data features is mainly divided into feature selection and feature extraction.This paper mainly studies feature selection algorithms.When studying the feature selection method,we found that if the unbalanced data set is balanced,the model performance can be improved,so this paper also studies the method of balancing the data set.This paper adopts a multi-dimensional data feature selection method combining genetic algorithm and dragonfly algorithm which names genetic dragonfly algorithm(GDA)to solve the problem of selecting feature subsets.People apply the traditional genetic algorithm to the feature selection process to get a better feature subset,but the traditional method has low accuracy and slow optimization speed.In order to speed up the convergence,this paper embeds the dragonfly algorithm into the genetic algorithm in the crossover and mutation processings.We find the optimal gene position and the worst gene position through the dragonfly algorithm to ensure the optimal gene reserved and the worst gene discarded during crossover.,The genes have the same probability of mutation in the traditional genetic algorithm.The genetic dragonfly algorithm sets different mutation probabilities according to the optimal position and the worst position,so that the optimal gene has a greater probability to be selected,and the worst gene to be discarded.This paper uses five different data sets and five different classifiers to test this method,which proves that the feature selection method proposed in this paper is more effective and robust.This paper also proposes a SMOTE algorithm which based on boundary enhancement and internal clustering(BEIC-SMOTE)to solve the problem of poor classification performance of imbalanced data sets.The traditional SMOTE algorithm randomly generates new samples.The improved SMOTE algorithm enhances the boundary,generates new samples for the minority samples at the boundary.Generating new samples only at the boundary is likely to ignore the internal samples.This paper considers the boundary and interior when generating new samples,which not only ensures that the boundary is clearly depicted,but also ensures that the features of the internal minority samples are enhanced.Experiments have proved that the BEIC-SMOTE method of equalizing data is more effective than other methods.
Keywords/Search Tags:Machine learning, feature selection, genetic algorithm, dragonfly algorithm, SMOTE
PDF Full Text Request
Related items