Font Size: a A A

Research Of Disease Genes Identification Based On Microarray Data

Posted on:2010-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:H P ZhangFull Text:PDF
GTID:1484303389457314Subject:Precision instruments and machinery
Abstract/Summary:PDF Full Text Request
Finding genes likely involved in clinical behavior of human disease is very important for understanding cancer pathogenesis, improving medical diagnosis and locating effective drug targets. Bioinformatics has become an inportant approach for life science research by analyzing all types of biology data. With the development of high-throughput technique, huge amount of biology data containing genetic information and disease information of human being are generated. Disease gene identification based on bioinformatics methods imposes new challenge for development of the associated techniques of data mining and knowledge discovery. Microarray technology provides new opportunity for modern functional genomics, which can produce large-scale data at the whole genome level. Mining disease genes of complex disease from gene microarray data at the molecular level is the hot topic in disease gene identification.The purpose of this thesis is to mine disease genes based on bioinformatics methods by applying proposed algorithms on gene expression data; meanwhile disease gene prediction softwares are utilized to mine obesity relevant genes. The main research work and innovative results in the thesis focus on the following aspects:1. An improved singular value decomposition method (LRSVD) is proposed to find genes that are associated with disease. LRSVD decomposes the gene expression data by SVD, and evaluates the contribution of each eigengene to the classifying accuracy by regression coefficients of logistic regression (LR) instead of the variance. It is necessary to transform back to the original data to evaluate each gene and shave off the genes with the low contribution to sample classification. The inner-product( IP )of each gene is proposed to evaluate each gene, which is defined as the inner product of the absolute coordinate vector of each gene with absolute regression coefficient vector; a larger IP value indicates that the corresponding gene is of high discriminative power for sample classification. The LRSVD method is applied to gene expression data; the obtained results are disease related genes with high classifying accuracy.2. An improved discrete particle swarm optimization algorithm combined with chaos and mutation operator (CMDPSO) for disease gene selection is proposed. In order to overcome the problem of premature convergence of basic PSO, the main idea of the improved CMDPSO algorithm is to integrate the chaos and GA mutation operator into the basic DPSO algorithm. The ergodic property of chaos can be used as an optimization mechanism to initialize the particles and produce new particles in iteration processes; the mutation operator of GA is useful for particles to escape from local optimums. The optimized gene subset is obtained efficiently by applying CMDPSO to gene expression data.3. The MIClique algorithm based on mutual information and clique analysis to identify differentially co-expressed genes subset is proposed. Mutual information is used to measure the co-expression relationships between each pair of genes in two different kinds of samples, and then the microarray data are transferred to graph with vertex corresponding to gene and edge corresponding to relationship between genes. The adjacency matrix of the graph is also obtained. The differentially co-expressed disease genes, which present a similar expression pattern in normal samples but suffer a distinct alteration in disease samples, are represented as a completely connected subgraph; so the problem of identifying differentially co-expressed disease genes is converted to clique detection based on adjacency matrix. Clique analysis and cohesive subgroups based on graph theory are introduced to represent biological modules of the similar function. MIClique can detect differentially co-expressed gene cliques in a simple and intuitive way. Not only the biological function of genes in differentially co-expressed genes subset are discussed, but the common biological pathways are researched.4. Several commonly used disease gene prediction softwares are introduced. A computational tool ENDEAVOUR is used to verify the probability of gene GAD2 being an obesity candidate disease gene. ENDEAVOUR evaluates each testing gene based on its similarity with the training genes (known disease genes). The computational results indicate that the gene GAD2 has the top prioritization, which means that the gene GAD2 has high probability to be a candidate disease gene for obesity. These results are helpful to explain the contradictory conclusions obtained by other researchers. Also the obesity related gene functions of GAD2 are discussed by collecting information from all biological databases and literatures.Finally, this paper summarizes the work and research fruits, and then presents the further research directions and objectives.
Keywords/Search Tags:gene expression data, disease gene, singular value decomposition, logistic regression, discrete particle swarm optimization, differentially co-expressed genes, disease gene prediction software
PDF Full Text Request
Related items