Font Size: a A A

Research On Feature Analysis Methods In Microarray Gene Expression Data

Posted on:2016-09-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:A G WangFull Text:PDF
GTID:1220330488992534Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of microarray technology enables us to measure the expression of thousands of genes simultaneously. Consequently, it has been widely used to study the gene expression profiles of various cancers and tumors, and provides a powerful tool to investigate disease mechanisms, cancer diagnosis and prognosis at the molecular level. Due to the characteristics of microarray technology and the limitations of current technologies, microarray data are characterized by "high dimensionality, small sample size, and high noise rate", and its corresponding gene space usually consists of a large number of irrelevant and redundant genes, which poses a great challenge to microarray data analysis. How to identify discriminant genes and further investigate the relations between genes and a certain cancer plays a significant role in understanding disease mechanisms and improving clinical diagnosis accuracy. In microarray technology, gene is usually named as feature in accordance with data mining concepts. This dissertation is targeted at microarray gene expression data and conducts in-depth research on gene selection methods, and its main contributions include the following points.(1) Using a classifier as the cost function to evaluate the quality of candidate features, wrapper-based feature selection methods are generally time-consumpting. For the k-nearest-neighbor classifier, we first point out that such a problem mainly arises from a substantial number of repetitive calculations of distance between samples, and then propose a logical storage structure named classifier distance matrix to explicitly store, calculate and update the feature distance between any two samples in order to improve the time performance in feature selection. Extensive experimental results show that the proposed approach can obtain a subet of features of high quality as well as greatly reduce the actual time cost. Finally, theoretical time and space complexity analysis indidates its efficiency and feasibility in actual use.(2) Wrapper-based feature selection methods are often required to conduct a large number of wrapper evaluations and have a high time-complexity in feature selection. We first point out that one of the main reasons is that wrapper methods fail to explicitly identify redundant features and remove them from candidate features. On the basis of this, we propose to embed markov blanket technique into the wrapper-based feature selection procedure to identify and remove redundant features in order to reduce the number of required wrapper evaluations. We then analyze and verify the effectiveness and efficiency of the proposed approach. Experimental results indicate that the proposed approach can obtain a high-quality feature subset, and significantly reduce the number of required evaluations and time cost.(3) Since partial least squares based recursive feature elimination method eliminates the least important features in each iteration, it has a high time complexity when performing feature selection on high-dimensional data. To alleviate this problem, inspired by the temperature decay rate in metallurgy annealing, we propose two feature selection algorithms, PLS-RFE-SA and PLS-RFE-SQRT using dynamic feature elimination strategies. The main idea is to remove a great number of features in initial iterations, remove a smaller number of features as the iteration proceeds, and continue the process until all features are ranked. Experimental results show that the two proposed approaches can obtain feature subsets of high quality and achieve significant improvement in time performance. Experimental results regarding feature subset consistency indicates that PLS-RFE-SA has better consistency than PLS-RFE-SQRT. We finally demonstrate their superiority in time performance with theoretical analysis.(4) Considering that data of high-dimensionality and small smaple size can lead to over-fitting easily, we propose a hybrid two-phase feature selection algorithm called mRMR-HS, which takes advantages of both filter methods in lower time-complexity and wrapper methods in high classification accuracy. In the first stage, we utilize minimal redundancy maximal relevance algorithm (mRMR) to select a small number of discriminant features from original feature space. In the second stage, we use harmony search algorithm (HS) that has global heuristic search capacity to generate candidate feature subsets, and then evaluate these features in a wrapper way and return the optimal one. Experimental results show that in comparison with mRMR, mRMR-HS can further identify discriminant features and obtain better accuracy, and that compared with HS, mRMR-HS can obtain a more compact feature subset and has a faster convergence rate.
Keywords/Search Tags:Microarray Data, Feature Selection, Filter Method, Wrapper Method, Markov Blanket, Partial Least Squares, Harmony Search
PDF Full Text Request
Related items