| With the development of biostatistics and artificial intelligence technology,using microarray technology to detect and evaluate cancer has greatly helped to improve the cure rate of patients.However,when use the gene microarray data to detecting cancer,the high dimensionality and imbalance of categories are two major challenges.In view of this,the research work in this thesis mainly follows feature selection and class balance processing,carries out experimental research on four open source cancer microarray datasets,The specific content is as follows:Firstly,for feature selection,in order to screen out cancer-related genes accurately,this thesis proposes a combinational iterative deletion Relief algorithm based on the traditional Relief algorithm.The combinational iteration deletion Relief algorithm firstly performs multiple rounds of the Relief algorithm,and then removes the redundant features by calculating the correlation coefficient with the Kth nearst neighbor.Experimental results show that compared with Relief algorithm,the combinational Relief algorithm proposed in this thesis obtains better classification results and have a smaller number of feature subsets.Secondly,for undersampling,in order to avoid the disadvantages of the traditional random undersampling methods that randomly removes samples and couses serious loss of dataset information,this thesis proposes an undersampling method based on Kmeans-FFT.This method combining Kmeans clustering algorithm and FFT to obtain the frequency-amplitude relationship of the sample.Then by judging the similarity of the frequency-amplitude information of the samples in each class,the samples with high similarity are eliminated.Experimental results show that compared with other undersampling algorithm the classification accuracy obtained by the undersampling method based on Kmeans-FFT are better.Thirdly,for oversampling,due to the SMOTE algorithm cannot finely control the number of synthesized new samples and does not make a discriminatory selection of minority samples,So this thesis improves the classic SMOTE algorithm.By introducing the distance and density functions,more new samples can be synthesized around the minority class samples which are closer to the majority class samples and in the sparse area.Experiment show that the classification accuracy can be improved by treating the samples differently.25 pictures,13 tables,69 references. |