Study Of Feature Selection And Class Balancing Methods In Cancer Microarray Data

Posted on:2023-01-28

Degree:Master

Type:Thesis

Country:China

Candidate:M Wang

Full Text:PDF

GTID:2544307070473664

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

With the development of biostatistics and artificial intelligence technology,using microarray technology to detect and evaluate cancer has greatly helped to improve the cure rate of patients.However,when use the gene microarray data to detecting cancer,the high dimensionality and imbalance of categories are two major challenges.In view of this,the research work in this thesis mainly follows feature selection and class balance processing,carries out experimental research on four open source cancer microarray datasets,The specific content is as follows:Firstly,for feature selection,in order to screen out cancer-related genes accurately,this thesis proposes a combinational iterative deletion Relief algorithm based on the traditional Relief algorithm.The combinational iteration deletion Relief algorithm firstly performs multiple rounds of the Relief algorithm,and then removes the redundant features by calculating the correlation coefficient with the Kth nearst neighbor.Experimental results show that compared with Relief algorithm,the combinational Relief algorithm proposed in this thesis obtains better classification results and have a smaller number of feature subsets.Secondly,for undersampling,in order to avoid the disadvantages of the traditional random undersampling methods that randomly removes samples and couses serious loss of dataset information,this thesis proposes an undersampling method based on Kmeans-FFT.This method combining Kmeans clustering algorithm and FFT to obtain the frequency-amplitude relationship of the sample.Then by judging the similarity of the frequency-amplitude information of the samples in each class,the samples with high similarity are eliminated.Experimental results show that compared with other undersampling algorithm the classification accuracy obtained by the undersampling method based on Kmeans-FFT are better.Thirdly,for oversampling,due to the SMOTE algorithm cannot finely control the number of synthesized new samples and does not make a discriminatory selection of minority samples,So this thesis improves the classic SMOTE algorithm.By introducing the distance and density functions,more new samples can be synthesized around the minority class samples which are closer to the majority class samples and in the sparse area.Experiment show that the classification accuracy can be improved by treating the samples differently.25 pictures,13 tables,69 references.

Keywords/Search Tags:

Cancer microarray data, Relief, Combinational feature selection, Kmeans-FFT, DD-SMOTE

PDF Full Text Request

Related items

1	Research On Cancer Feature Gene Selection Based On Microarray Data
2	An Research On Feature Selection Of Tumor Markers Based On Microarray Data
3	Research On EEG Depression Identification Based On Feature Selection And Ensemble Classification
4	Biomolecular feature selection of colorectal cancer microarray data using GA-SVM hybrid and noise perturbation to address overfitting
5	Research Of Feature Selection For Tumor Gene Expression Data
6	An Improved Filter Feature Selection Method And Its Application On The Identification Of Tumor Markers
7	Studt On Disease Diagnosis Based On Relief Feature Selection And Mixed Kernel SVM
8	Research On Tumor Feature Gene Selection Method Based On DNA Microarray Data
9	Research On The Cancer Microarray Data Feature Selection Method Based On Krill Herd Algorithm
10	Fundamental Theory And Application Study On Large For Gestational Age Infants Using Machine Learning Techniques