Font Size: a A A

The Research On Feature Selection And Cancer Classification Based On Correlation

Posted on:2013-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:X H PengFull Text:PDF
GTID:2254330425483747Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Microarray techonology which produces gene expression data is a powerful toolfor gene function studying and can analyze thousands of genes at the same time.Cancer classification and identification of key genes associated with cancer hasbecome an important part of cancer research. Because high dimension and smallsample size of the microarray data, the traditional data mining methods can not beenused very well.This paper analysed and summarized feature selection method and classifiers onmicroarray data penetratingly. Improved feature selection method and classifier basedon correlation-based feature selection was proposed which could be used to improvethe classification accuracy and data generalization ability in cancer classification.Thecontent of this thesis is summarized as follows:A new feature selection method based on correlation-based feature selection wasproposed.The gene expression data has to get through pretreatment because of thedatasets with high dimension, less sample and much noise. Firstly, added the missingvalue and normalized the data. Then removed genes which with small variancebecause they have little significance on the classification. It could reduce thedimensional and the time complexity. Finally, the measures of variable to variable andvariable to observe were calculated respectively. Heuristic search method was utilizedto search the space of variable for selecting informative gene subset and the subsetweight was computed using these measures. Through regression we obtained a subsetof distinguished genes. The stratified sampling strategy was presented to obtain themost informative genes. Experiments on three gene expression datasets could improveclassification accuracy effectively.The base calssifier were trained by the feature datasets selected by CFS-SS.Because the feature data were different with each other, some calssifiers withdifference were obained after training. Integrated them with voting integration, we gota new ensemble calssifier. Some classification experiments on microarray geneexpression data were verified the feasibility and reliability of this method.
Keywords/Search Tags:cancer classification, Correlation-based feature selection, stratifiedsampling, ensemble classifiers
PDF Full Text Request
Related items