Research On Relevant Problems Of Tumor DNA Microarray Expression Data Analysis | | Posted on:2010-08-28 | Degree:Doctor | Type:Dissertation | | Country:China | Candidate:G Y Wang | Full Text:PDF | | GTID:1114360305473646 | Subject:Control Science and Engineering | | Abstract/Summary: | PDF Full Text Request | | With the development of Tumor Genomic Project, DNA microarray is widely used in tumor research. Tumor DNA microarray can provide a great number of gene expres-sion data for tumor genomic research, which reflects the fluctuation of gene expression level in different development stage or physiological state of different tissue cells. Be-cause of the capability of uncovering the nature of tumor on the genomic level and pro-viding a kind of new systematic method, the analysis of tumor gene expression data has got great attention. At present, researchers have confirmed some tumor genes and ac-cumulated some knowledge relative to oncogenesis and the regulation mechanism of tumor genes. But these achievements are too little to understand and cure tumor. Thus how to effectively analyze tumor gene expression data has become a problem which must be solved as soon as possible. So taking tumor DNA microarray expression data analysis as the research topic, this dissertation refers to studies on relative preprocessing techniques, cluster analysis algorithms and gene regulation networks modeling methods. The main contents and creative contributions of the dissertation are summarized as fol-lows:(1) The research on methods for missing value estimation and normalization of gene expression data. For the missing value estimation problem, we found that the similarity between gene expression data influences the estimation precision, and the di-mensional distribution of the gene expression data without missing values is a favorable reference to the estimation of missing values. So this dissertation presents a new miss-ing value estimation method based on K-nearest Neighbor and Support Vector Regres-sion (KNN-SVR). This algorithm takes genes without missing values and much similar to genes whose missing values are to be estimated as the training sets, and establishes regressive models through SVR to estimate missing values. This algorithm has better accuracy and stability. In the classification and class discovery of tumor gene expres-sion data, the current normalization methods are likely to make the samples be classified incorrectly. So this dissertation recomposes the normalization methods and uses class information to normalize gene expression data, which makes gene expression data more suitable to the analysis of the classification and class discovery of tumor gene expres-sion data.(2) The research on methods for gene cluster analysis of tumor time series mi-croarray data. In order to identify the asynchronous or local correlation in expression profile, this dissertation presents the concept of Local Maximum Correlative Coefficient (LMCC) and defines the correlative relationship between genes. And then the rules of setting maximum time delay and minimum local time segment are studied. Lastly, this dissertation presents a new clustering method which uses LMCC as the similarity measure of K-means method and makes some corresponding improvements. This method can identify the asynchronous or local correlation preferable and LMCC can provide a more effective measure for similarity.(3) The research on methods for gene cluster analysis of tumor non-time series mi-croarray data. In order to eliminate noise and identify genes with unobviously differen-tial expression in microarray data, this dissertation presents the model of Constrained Independent Component Analysis (CICA) with decreasing noise (deCICA) and uses this model to cluster tumor non-time series microarray data. The clustering method based on deCICA model includes two parts. Firstly, this method extracts a Gaussian white noise to eliminate the noise in gene expression data, in which the statistic of Ljung-Box Q is used as the constraint to the'white'character and gaussianity maximi-zation is used as the object. Secondly, this method uses CICA model to cluster the de-noised gene expression data, in which the expression data of target genes are used as the constraint to the relative biological processes or functional clusters and nongaussianity maximization is used as the object. Because of the capability of eliminating noise partly and retaining the specific information in expression data, this method can identify genes with unobviously differential expression effectively.(4) The research on methods for constructing gene regulatory networks. This dis-sertation first builds the N-order Dynamic Bayesian Network (N-DBN) to model the multi-time delay in gene regulation, and then presents a new method for constructing multi-time delay gene regulatory network using N-DBN by combining expression data with multiple independent sources of prior knowledge (N-DBN-MP). In order to com-bining with time series microarray data, this method transforms multiple independent sources of prior knowledge into different prior probability distributions according to their characteristic, and uses Markov Chain Monte Carlo (MCMC) algorithm to learn the network structure of N-DBN. During the MCMC learning, the acceptance probabil-ity of network structure is decomposed on the basis of the hypothesis that microarray data is independent with prior knowledge, which realizes the fusion of microarray data and prior knowledge. N-DBN-MP can not only effectively identify the regulation rela-tionships between genes, but also reduce the affect of noise in microarray data. | | Keywords/Search Tags: | Tumor, DNA Microarray, Missing Value Estimation, Cluster Analysis, Gene Regulatory Network, LMCC, deCICA, N-DBN | PDF Full Text Request | Related items |
| |
|