Font Size: a A A

Applying Of Support Vector Machines In Microarray Gene Expression Data Classification

Posted on:2005-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:C WuFull Text:PDF
GTID:2144360125468465Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
The wildly used Gene Chip (Microarray) technology in the field of gene study has brought explosive increasing of microarray experiment data, which also called gene expression data. Gene expression data is always large-scaled and has unbalanced observations and samples in the data matrix. There are also many missing values in the dataset from different sources. Most traditional statistical methods can't treat such dataset at all. Researchers must seek for new methods. Early days, people always analysis gene expression data with clustering algorithm and get some believable results. With the knowing of gene classes, researchers need more effective algorithms to make use of this information to predict unknown genes' functions. So supervised algorithms that base on acute results of biology experiments become the new focus of gene expression data analysis. Among them, support vector machines (SVMs) which bases statistical learning theory is one of the youngest supervised algorithms. It have many features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when treating large data sets, the ability to handle large feature spaces, and the ability to identify outliers. But as a new technology of machine learning, few chip researchers know SVMs. And there are few paper on how to analysis gene expression data using SVMs. When chip experimenter and biologists got their cherish first-hand material", they would perhaps miss some important information because of the deficiencies in analyzing algorithms.This paper describes the algorithm of SVMs based on the status quo of gene expression data analysis and gives the specific SVMs algorithm and training process especially for gene expression data, hi the paper, we mainly described how to establish complete SVMs algorithm for gene expression data basing on a well-known gene expression database, MYGD, provided by MIPS. Otherwise, we improved the SVMs algorithm in two aspects: training speed and predicting acuity, applied them in gene expression data analysis and got delighted results. All of the experiments results and discussing of relative problems are in the end of the paper. There are also some problems that we don't resolve and the work that we will do next in the last part. Since gene expression data can be analyzed in general statistical analysis process despite of its own features. The paper started with data cleaning, including missing value filling and data normalization. In this party, we compared three filling methods of gene expression missing data and some normalization methods. Based on the cleaned dataset, we introduced different kernel functions of SVMs and several feasible SVMs software processes for gene expression data classification. Otherwise, we introduced two improved SVMs algorithms, SISSVM and SVM-KNN, to treat gene expression data.Though the experiments, we got following results: at first, KNN and filling with means of class method are better than other filling methods. These two haven't statistical significance. Researchers should select either of them based on the aim of study. Second, comparing with other kernel functions, RBF SVM and higher degree polynomial function SVM are better in recognizing genes of the same functional class using gene expression data. Third, our SVMs process is very easy to use and we gave some programs to help researcher carry it. We compared it with some generally used SVMs algorithm based on the same dataset. And the results showed that the two algorithms have the same predicting acuity and training speed in training. Forth, SVM-KNN algorithm can increase the acuity of model at some degree and SISVM can raise the speed of training without losing the predicting acuity in gene expression data analysis.All in all, as a new tool treating microarray data, SVMs bases on good theory and has a wonderful perspective. SVMs itself and its improved algorithm will show advantages in much wider gene researching field.
Keywords/Search Tags:Gene Expression Data, Pattern Classification, KNN method, Support Vector Machines, SISVM algorithm
PDF Full Text Request
Related items