Font Size: a A A

Comparative Study Of Statistical Methods For Microarray Data Analysis

Posted on:2010-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:L F DanFull Text:PDF
GTID:2144360275981054Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
IntroductionGene microarray technology is capable of measuring the expression strength of thousands of genes in an experiment, making it a powerful tool to solve many important molecular biology and medicine questions in life sciences. The main application of gene chip technology is looking for significantly differential expression genes between the samples, classifying them based on these differential expression genes, and hoping to access to better classification results with less numbers of genes, which benefits for clinical diagnosis, treatment and research of functional genomics with great significance.Lacking of good data analysis tools is the main problem constraining the development of chip technology. Microarray data analysis is difficult for several reasons. First, a small sample relative to much more genes results in low sensitivity and low specificity at the same time. Second, the gene expression data is based on the traditional statistical methods, instead of the actual data from the point of view of the non-linear. That is, gene expression data contains four major constraints for its further development, including a large amount of data, high dimensionality, small sample size, and non-linear characteristics. Generalized likelihood ratio test (GLRT) is suitable for statistical data with many variables, low-expression, and non-linear characteristics. The parameter of GLRT is-21g~λ, approximately subject to x~2(1) distribution, so the inherent error is effectively under control. Support Vector Machine (SVM) approach is widely used in pattern recognition, nonlinear modeling, and so on, for resovling the statistical data with small sample size, non-linear, high dimension, and local extremum successfully. In this study, generalized likelihood ratio test combined with support vector machine method are used to extract differentially expressed genes and then to classify and optimize classification on this basis.Materials and MethodsThis study used data sets from Golub in 1999. 7129 gene chip data from acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) leukemia patients were obtained by high-density oligonucleotide array detector. Training set included 38 samples, with 27 ALL and 11 AML. Test set including 34 samples, with 20 ALL and 14 AML.Differentially expressed genes in training set were dentified significant differences based on the use of generalized likelihood ratio test, and then the validity of the results were accessed with biological knowledge. Three kinds of kernel function support vector machine model, neural network and Golub domain analysis model at the basis of these differentially expressed genes. The input and output of training set and test set were normalized using the software Matlab7.0. Prediction results were evaluated by the percentage of correct classification, aimed for selecting a good model to perform the classification.ResultsThrough the generalized likelihood ratio test, 50 genes with significant differences were identified, and most genes were verifed to associate with leukemia based on biological knowledge, only few genes with relevant reports. Classification accuracy was 100%, 100%, 89.5%, 94.7%, and 94.7%, individually, when using polynomial SVM, Radial basis SVM, Sigmoid SVM, neural network and Golub domain analysis model to detect the training set, and the correct classification rates were 94.1%, 97.1%, 88.2%, 88.2%, and 85.3% for test set. Classification accuracy of radial basis SVM model for the first 40, 30, 20, 15, 10, 8 genes in training set and test set were 100%, 94.1%, 97.4%, 91.2%, 97.4%, 94.1%, and 100%, 94.1 %, 97.4%, 85.3%, 92.1%, 85.3%. ConclusionsGeneralized likelihood ratio test, with the characteristics of sensitive analysis for many variables, low-expression and non-linear data an, were selected to identify gene differences in this study. The results of the identification combined with of current research works on leukemia molecular marker, show their significant relationship with different types of leukemia. Only few genes show less relevant literature reports, which may be able to provide a number of new molecular markers for classification of AML and ALL.Support vector focuses for analyzing the data with small sample, nonlinear, high dimension, and local extremum, widely used in the field of pattern recognition and nonlinear modeling. In this work, the results from the first two non-linear kernel function is basically the same, therefore the support vector machine of different non-linear kernel function (Sigmoid function except) preserves the roughly same performance. Ultimately RBF SVM with the best classification results was selected to perform this study. Category optimization results show that the classification is the best when using the first 15 selected genes during analysis.
Keywords/Search Tags:Leukemia, DNA microarray, Generalized likelihood ratio test (GLRT), Support vector machine (SVM)
PDF Full Text Request
Related items