Font Size: a A A

Tumor Subtype Multi-class Classification And Analysis Based On Gene Expression

Posted on:2009-04-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:W L XuFull Text:PDF
GTID:1114360242495851Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Tumor is one of the most difficult human problems.Different subtypes of tumor have different features.At gene expression level,they are controlled by the expression of one or many genes and their interaction,however,it is hard to classify them through clinical analysis.DNA microarray technology has provided biologists a powerful tool to associate phenotypes with molecules.It is commonly used for comparing the gene expression levels under different condition,such as normal or cancerous tissue.As the expression level of thousands of genes can be measured simultaneously in a single experiment,a problem comes forth,that is,how to interpret these data.Tumor multi-class classification and analysis based on gene expression,try to classify different kinds of tumors or subtypes at gene level,and analyze the tumor relative genes.In tumor research,these discriminatory genes will help to classify different tumor types,lead to a better understanding of genetic signatures in cancers and improve treatment strategies.Since the gene expression data is noisy with high dimension and correlation, how to use them to classify different types of tumor,still faces many hard problems. For example,the dimension of gene is usually much higher than the number of sample,the expression of irrelevant genes weakens the performance of classifiers and increases the computing cost of machine learning.In this dissertation,some original research works by the author can be formulated as follow:1.Gene selectionGene selection is to detect the most significantly expressed genes under different conditions expression data.The current challenge in gene selection is the comparison of a large number of genes with limited patient samples.Thus it is trivial task in simple statistical analysis.Various statistical measurements are adopted by filter methods applied in gene selection studies.Their ability to discriminate phenotypes is crucial in classification and selection.Here we describe the Standard Deviation Error Distribution(SDED)method for gene selection.It utilizes the variations of within-class and among-class in gene expression data.We tested the method using 4 leukemia datasets available in the public domain.The method was compared with the GS2 and CHO methods.The Prediction accuracies by SDED are better than both GS2 and CHO for different datasets.These are 0.8-4.%%and 1.6-8.4%more that in GS2 and CHO.The related OMIM annotations and KEGG pathways analyses verified that SDED can pick out more 4.0%and 6.1%genes with biological significance than GS2 and CHO,respectively.2.Tumor subtype multi-class classificationUsing pattern classification technology such as Support Vector Machine, Artificial Neural Network and Decision Tree to classify diseases or subtypes,has made a great achievement.However,there are still many problems for multi-class classification,such as the accuracy and efficiency are not high enough,the selected gene are not biology meaningful.There are mainly two categories for multi-class classification:One is the decompositions of multi-class problems into binary ones. The main disadvantage of this type is that the number of classifiers increases exponentially as the number of types(classes)increases.The computational cost also increases squarely or even exponentially.The other type is the binary classification algorithms that can be naturally extended to handle multi-class problems directly such as discriminant analysis.The discriminant analysis is based on one specific distribution,the normal distribution,and such techniques are more powerful provided the assumptions hold,however,it mainly deals with only linear relationships among the independent and dependent variables.In this paper,we developed a novel multi-class classification method SGMM:Simple Gaussian Mixture Model.It combines the advantages of discriminant analysis and Gaussian Mixture Model. Different from binary classification,this method reserves more information and is useful for multi-class tumor subtypes diagnosis and treatment.Four datasets were collected and used to evaluate the prediction performance.The classification accuracies are all about 2%higher than K-Nearest Neighbor classifier and comparable well to Support Vector Machine for Leave-One-Out Cross Validation. The results demonstrate that this method is simple and efficient even more less computational cost.It is a useful tool for multi-class tumor classification.
Keywords/Search Tags:Bioinformatics, Tumor, Gene Expression, Gene Selection, Multi-class Classification, Standard Deviation Error Distribution, Simple Gaussian Mixture Model, Support Vector Machine, K-Nearest Neighbor
PDF Full Text Request
Related items