Font Size: a A A

Research On Ensemble Feature Selection Algorithm For Biomedical Data

Posted on:2022-12-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y HanFull Text:PDF
GTID:1480306758479234Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Medical digital development has brought a lot of biomedical data,and the modeling of these data helps researchers diagnose and treat diseases,understand the pathogenesis of the disease.Biomedical data are mainly divided into two types: biological omics data and medical data.Biomedical data is complex,and there are a lot of redundant features and noise in the data.Many features can be replaced by other data,and the features are highly correlated.The key to biomedical data mining is to design high-performance models that are interpretable.Machine learning is an important method of modeling analysis in biomedical fields.Feature selection is an important technology in the field of machine learning.The goal of feature selection is to find the feature subset that produces the best modeling performance from the feature space of data.In this way,the feature dimension can be greatly reduced and the calculation cost can be reduced,which is conducive to the subsequent modeling.The selected feature subset can not only make the model have higher performance,but also show the internal correlation of data more directly.Traditional feature selection algorithms are difficult to obtain good model performance on biomedical data.The ensemble feature selection algorithm can combine a variety of feature selection methods according to the characteristics of biomedical data,resulting in more accurate and more stable and reliable classification results.Aiming at the key problems in feature selection of medical data,taking transcriptomics data,methylomics data and medical data as objects,this paper proposes3 ensemble feature selection algorithms.The research contents are summarized as follows:1.In order to solve the problem of high-dimensional small samples of biological omics data and high redundancy and high correlation among features,a dynamic recursive feature elimination(d RFE)framework is proposed.The d RFE algorithm first uses t-test to screen the feature set,then trains a supervised model,deletes the features with the smallest coefficients of different numbers of models for testing,calculates the classification performance of the model after deleting the corresponding features,and selects the feature set that produces the best model performance in the current iteration.The comparative experiment results on 18 transcriptomic data and 5 methylomic data show that the d RFE algorithm outperforms the existing 11 feature selection algorithms in most cases.2.In order to solve the problem of a large number of irrelevant and redundant features in high-dimensional biological omics data,an ensemble swarm intelligence feature selection algorithm(Zoo)is proposed.The Zoo algorithm firstly selects 1000 features by t-test,then nine feature selection algorithms based on swarm intelligence are integrated to vote for the selected features,and finally the dynamic recursive feature elimination(d RFE)framework is used to further refine the feature subset.The main idea of this algorithm is that different swarm intelligence algorithms used to solve the problem of feature selection provide complementary search capabilities in the feature space.their cooperation produces a feature subset that is more accurate and stable than a single swarm intelligence feature selection algorithm.The experiment results show that the performance of Zoo algorithm in transcriptome dataset and methylation dataset is better than that of nine existing feature selection algorithms.3.In order to solve the problem of complex and redundant relationship between the features of medical data,an ensemble population adaptive weight Gray Wolf Optimization feature selection algorithm(PAWGWO)is proposed.Firstly,a new adaptive internal learning gray wolf optimization operator(NAILGWO)is proposed to solve the problem that the standard GWO algorithm only learns from three best search agents,resulting in the algorithm missing excellent solutions near other search agents.A Hunger Area Information Restart(HAIR)strategy is proposed,in which some gray wolves in the population are randomly selected in each iteration to explore near other excellent search agents in order to enhance the local search ability.The performance of the PAWGWO algorithm is evaluated on 8 medical datasets and 24 benchmark datasets in other fields,and compared with 10 other meta-heuristic algorithms and 8 non-metaheuristic feature selection algorithms.The experiments prove the PAWGWO algorithm outperforms existing feature selection algorithms in most cases.
Keywords/Search Tags:feature selection algorithm, biomarker detection, swarm intelligence algorithm, machine learning, biomedical data
PDF Full Text Request
Related items