Due to the very high incidence and mortality rate of cancer,cancer diagnosis and treatment is one of the issues that people focus on in modern times.Accurate diagnosis helps to explore the relationship between genes and cancer,and plays a very positive role in cancer prevention.However,the internal relationship of cancer gene expression data sets makes cancer diagnosis full of challenges,because the sample size and dimension ratio of gene expression data are huge,and due to the interference of noise,redundancy and other information in the data,cancer classification based on gene expression data becomes a very challenging work.This also makes it increasingly important to conduct feature selection before classifying gene expression data.In this paper,the data of gene lists of Breast,Leukemia-4c and Lung were studied,and their characteristics were selected and classified.The specific work of this paper is as follows:In the feature selection part,aiming at the small sample size and large dimension of high-dimensional biomedical data,this paper proposes a two-stage feature selection framework based on the combination of Wrapper,embedded and filter,so as to avoid the curse of dimension.In the first stage,the proposed framework uses weighted gene coexpression networks(WGCNA),random forests and maximum correlation minimum redundancy(MRMR),and combines the results of the three methods.In the second stage,a new binary gene selection method based on the improved Salp Swarm Algorithm was proposed.This method combined with machine learning method(Light GBM,RF,SVM,XGBoost,MLP,KNN and other six common classification methods)to select feature subsets suitable for the classification algorithm.And compared with other five intelligent optimization algorithms in convergence,number of features and accuracy.In the classification stage,the features selected by different classification methods will have some deviations,because different combinations of features in the classification methods may achieve similar classification results.Therefore,in the classification part of this paper,all the features selected based on the improved Salp Swarm Algorithm are sorted out to extract the features with high recurrence frequency,and these features are input to the auto-encoder for dimensionality reduction to further reduce the feature dimensionality while improving the classification accuracy.In order to improve the accuracy of classification,this paper not only predicts the class label of the data,but also predicts the class probability of the data,and proposes to construct the weight according to the idea of attention mechanism,and carry out the model weighted ensemble of six heterogeneous classification methods applied in this paper.In summary,this paper proposes a two-stage feature selection framework with gene expression data as the research object,and the results show that the proposed method can solve the feature selection problem related to high-dimensional data,and the proposed framework has no limitation of data set,and it can be applied to other fields involving feature selection.A weight construction method based on the idea of attention mechanism is also proposed for integrating multiple models,and the results show that the accuracy of the integrated classification is significantly higher than that of having a single classification method. |