Font Size: a A A

Research Of Parallel Clustering And Ensemble Classification For Gene Expression Data

Posted on:2017-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:R LiFull Text:PDF
GTID:2310330488459726Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The research of bioinformatics is at a data explosion era. Recent years, the technical progress in genomics, metabonomics, transcriptome and proteomics, which allows biologist to have more data to analyze organism in various aspects. The abnormal of gene expression often signifies an unusual vital movement. The change on expression value can be presented in gene expression data through microarray technology. The analysis of gene expression data can be used for disease diagnosis on human and animal, and learning the abnormal phenomena in the process of plant growth. In recent years, combining different types of biological data together has become a trend in the field of bioinformatics. This kind of method, named biological data integration method can help researchers to find potential relations between data and have a better understanding of the nature of some biological phenomena.Clustering is an effective tool for dimension reduction on gene expression data. By clustering tens of thousands of genes, the number of genes in each cluster dropped to hundreds or even dozens. In this paper, biological knowledge is integrated in the process of clustering, in order to improve the biological interpretability of the results. Meanwhile, based on great diversity between gene subsets, classification model is constructed using ensemble learning method for gene expression data classification problem.Gene ontology database provides massive gene functional information. As the gene clusters are biologically uninformative, gene ontology knowledge can be used to calculate the biological function similarities between genes, and combine them with gene expression data. Affinity propagation is applied on the integrated data, thus gene subsets with higher biological significance are obtained. Based on clustering result, neighborhood rough set is used to select representative genes for each cluster, then build more robust ensemble classifier. Experimental result on plant stress response datasets show that combining gene ontology knowledge is effective.Simple gene preselection process may lose some genes with potential classification value, in this article, parallel computing technology is used to implement the parallelized affinity propagation algorithm, clustering the original genes directly. For more gene subsets may be produced after clustering, random hill climb search algorithm is applied for classifier selection, select an appropriate groups of classifiers for ensemble classification. Experimental result on plant stress response datasets show that the proposed method can select genes with stronger classification ability.
Keywords/Search Tags:Knowledge Integration, Ensemble Learning, Gene Expression Data, Parallel Computing
PDF Full Text Request
Related items