Font Size: a A A

The Research Of Cancer Feature Genes Selection Based The Gene Expression Data

Posted on:2017-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X D LiFull Text:PDF
GTID:2334330503492765Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of microarray technology and the accumulation of numerous cancer gene expression data with different types, choosing the key genes for cancer classification from the gene expression data has attracted researchers' wide attention. But the gene expression data has the characteristic of high dimensionality and small sample size, which pose a rigorous challenge for cancer classification and is easy to evoke the over-fitting and the curse of dimensionality, resulting in a long computing time and lower classification accuracy. Since not all the genes are associated with cancer, the irrelevant features seriously reduce the learning performance. So it is necessary to select the cancer-related feature genes from the massive data, which provides an important reference for cancer clinical diagnosis and classification. As a typical method to deal with the gene expression data, Feature selection method can reduce the dimensions and the computational complexity, it can also reduce redundancy and improve the classification accuracy effectively. The most important is that the selected feature has clear biological significance.Therefore, in this article, we study the feature selection methods based on the cancer gene expression data from three perspectives. Such as the single ranking, Subset score and sparse embedding, the main results are as follow:(1) Firstly, we improved the feature extraction method LLRFC as a new filter feature selection method, which is named as LLRFC score. Then this new method is improved through eliminating redundancy among the features. The proposed method is named as LLRFC score+.Based on the analysis of LLRFC feature extraction algorithm, we improve it as a single feature ranking feature selection approach, this improved method is named as LLRFC score. Because this method hasn't considered the correlation between features, the selected feature subset has redundant features. So we proposed a feature selection method LLRFC score+, which is composed with the Pearson correlation coefficient and LLRFC score. It can remove redundancy effectively. Several other feature selection approaches are used to compare with our method on nine public cancer gene expression dataset, the experimental results demonstrate that our presented method is quite promising and valid for tumor classification.(2) Secondly, we proposed a subset score feature selection method based on SLLE, which is named as SSLLE.Supervised method(SLLE) can effectively maintain the local properties between da locally linear embedding ta and fully consider the sample label information, which is widely used for the classification of high-dimensional data. But it selects the features based on the individual feature ranking, the relationships between features are ignored, the selected feature subset is not the optimal. Therefore, in this article, we proposed a subset-score feature selection method SSLLE(Subset-score Supervised Locally Linear Embedding), which is an iterative optimization subset scoring method based on the SLLE method under the framework of graph theory. Compared with the feature score method FSLLE on six different gene expression datasets, the results demonstrate that the SSLLE method gets preferable classification accuracy.(3) Finally, we proposed a feature selection method joint the sparse learning and the locally linear embedding learning, which is named as JLLESR.The sparse learning method can effectively used for feature selection, but it selects the features based on the global structure of the gene expression data, ignoring the local structure of the data. The embedded learning method can well maintain the local neighborhood relationship between the features. So, in this article, we propose a sparse embedded unsupervised feature selection method, which is named as JLLESR. It joints the sparse learning model and the LLE methods. The 2,1norm of the matrix formulated by the transformation vectors is added as a penalty function to the objective function(the common is the least squares regression), the features are selected according to its sparse contribution to the regression. Several other feature selection approaches are used to compare with our method on six public cancer gene expression datasets, the experimental results demonstrate that our presented method is quite promising and valid for cancer classification. This method is not affected by the samples label and the parameters, the selected features can also have a good ability to maintain the local neighbor relationship of the data and have a good biological interpretation.
Keywords/Search Tags:gene expression data, feature selection, LLRFC score+, SSLLE, JLLESR
PDF Full Text Request
Related items