Font Size: a A A

Study Of Dna Microarray Data Of Variable Selection Methods

Posted on:2012-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2190330335490510Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Currently, gene expression data obtained from microarray technology has been widely used for different kinds of cancer diagnosis. Due to thousands of gene expressions simultaneously recorded in one experiment, bioinformatics methods such as clustering and classification are applied to understand and interpret the data. Although microarray-based disease diagnosis is fairly promising for its convenient and effective manner, a problem referring to microarray dataset itself is challenging. That is usually mentioned "large p, small n" problem, meaning that the number of gene expressions overwhelmingly exceeds the number of available tissue samples by orders of magnitude. Thus, variable selection needs to be applied first before establishing an accurate model. In this work, we put up a new variable selection method, and compare it with the other variable selection methods.1. Based on model population analysis (MPA) and uninformative variable elimination (UVE), we put up a new variable selection method, Noise Incorporated Subwindow Permutation Analysis (NISPA) coupled with support vector machines. NISPA successfully solved the problem of model uncertainty, and use the added noise as the cutoff level, not user definition. The essence of NISPA lays on the point that variable importance distribution of added noise variables are used as the references to assess the experimental variable distributions, and all the variables could be divided into three categories:informative variables, uninformative variables (noise variables) and interfering variables. To compare with conventional variable selection methods, NISPA is the first to distinguish the interfering variables. In this study, two microarray datasets are employed to evaluate the performance of NISPA, Colon and Estrogen. The results show that the prediction errors of SVM classifiers could be significantly reduced by variable selection using NISPA. It is concluded that NISPA is a good alternative of variable selection algorithm.2. We apply a deep research into NISPA from the following three aspects:(1) comparing NISPA Q=1 with the common univariate variable selection methods, such as Pearson correlation coefficient and Spearman rank correlation coefficient, the results indicate that NISPA Q=1 shows different levels of consistency with the univariate methods on different datasets, and NISPA Q=1 is more efficient than the univariate methods to select information variables. (2) comparing Q=1 NISPA and NISPA with optimal Q, we find that the variable importance values computed under these two different conditions have big differences, and the optimal Q NISPA could select more efficient variables to establish a model giving a much lower prediction error. This indicates that the interactions between the variables could magnificently improve the final result. (3) Comparing NISPA with other multivariate methods, such as based on SFS variable selection method and recursive feature elimination, leave-one-out cross validation result shows that NISPA is competitive to other multivariate methods and it is a good alternative of variable selection method.
Keywords/Search Tags:Model Population Analysis, Uninformative Variable Elimination, Univariate variable selection, Multivariate variable selection, DNA microarray technology
PDF Full Text Request
Related items