Font Size: a A A

Dimensionality Reduction And Clustering Ensemble Of Tumor Gene Expression Profile

Posted on:2015-06-12Degree:MasterType:Thesis
Country:ChinaCandidate:J S PanFull Text:PDF
GTID:2284330461974801Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Human health has been influenced by tumor with almost amazing speed in recent years, at the same time, the constantly improving of gene chip technology has pushed the explosive growth of tumor gene expression profile data. Tumor gene expression profile data provides researchers with an opportunity to explore the mysteries of the tumor. However, they still have some problems, such as high dimensions, small-sample size, high noise and imbalanced data distribution, which bring great challenge to research work.How to effectively analyze and study of such data has attracted a lot of interest in academics. At present, there are two main research topics on tumor gene expression profile analysis:the first is effective algorithms to select the core disease-causing genes or latent variables; the second is stable and dependable clustering algorithms to discover tumor subtypes. In this paper, we discuss these two aspects and give a brief description as follows:1、Neighborhood components analysis (NCA) is applied to the classification problem of tumor gene expression profile first in this paper. Because the randomness and blindness when we set the initial matrix of NCA, the standardized right singular matrix, which is obtained by SVD, is used as the initial value of NCA algorithm to make the initial matrix contain as much information as possible. The experiment results show that the improved NCA algorithm, called INCA, can effectively extracts the classification information and improves the classification recognition rate of tumor.2、Due to the high dimensions character of tumor gene expression profile, we propose a new feature gene selection method which combines the fisher feature selection algorithm with Pearson correlation coefficient. The fisher index that is defined in the traditional fisher algorithm neglects the difference of variance between different class samples when it quantifies the importance of gene. The result is that some genes which are the same in mean but different in variance are removed incorrectly due to the zero of traditional fisher index. In order to avoid this problem, we propose an improved algorithm Vfisher which introduces a new gene importance index called Vfhiser. Moreover, without affecting the classification accuracy, we use Pearson coefficient to remove the redundancy genes which have smaller value of Vfisher index to compress the data further.3、Gene expression profile has the characters of high dimension and small sample size, which leads many common cluster algorithms to not get better clustering results on it. Single cluster algorithm has the defects of low accuracy rate and poor stability, while cluster ensemble can overcome the defects of single cluster algorithm to some extent. Traditional cluster ensemble algorithm focuses on studying consensus functions and cluster members, and ignores the selecting of benchmark to unify cluster labels. In this paper, we propose the MSA algorithm introducing the Silhouette index and taking the cluster member which has the highest value of Silhouette index as the benchmark to unify the cluster labels. In comparison with the existing cluster ensemble algorithms, MSA algorithm has certain superiority.
Keywords/Search Tags:tumor gene expression profile, dimensionality reduction, feature selection, clustering ensemble, classification
PDF Full Text Request
Related items