Font Size: a A A

Study On Informative Gene Selection And Classification Of Tumor

Posted on:2016-11-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y ZhangFull Text:PDF
GTID:1314330512966473Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Tumors are the consequences of interactions between multiple genes and the environment. The emergence and rapid development of large-scale gene-expression technology provide an entirely new platform for tumor investigation. Data mining based on gene expression profile plays an important role in the discovery of pathogenic genes, clinical diagnosis of tumor, judgment of therapeutic effect and mechanism of pathogenesis. Tumor gene expression profile data has the following features:high dimensionality, small or relatively small sample size, large differences in sample backgrounds, presence of nonrandom noise (e.g., batch effects), high redundancy, nonlinearity, and pairwise gene interactions. Traditional statistical methods and pattern recognition methods are limited. In this paper, According to the characteristics of gene expression profile data, the methods of information gene selection and the constructions of the classifier are studied. The main results are as follows:(1) Binary Matrix Shuffling Filter (BMSF), a new feature selection method for high-dimensional data based on support vector machine, is proposed. Most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. In this paper, considering the interaction between multiple genes and introducing a random binary matrix, BMSF converts the classification problem to a regression problem. According to the optimization of the kernel function parameter, the high-dimensional feature selection is realized based on support vector machine regression. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The 9 binary classification datasets are expressed in the BMSF, and the accuracy of the prediction is far better than that reported by the literature. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature.(2) Top Scoring Genes (TSG), a new method based on chi-square test, is developed for high dimensional feature selection and direct classification. The prediction accuracy is related to not only the feature selection but also the classifier. Training is the major reason for model overfitting. The TSP family performs both feature-selection and classification. In this paper, TSG is developed. It overcomes the following problem of TSP family:it cannot reflect the size differences among samples, the numbers of the selected information genes are always even, and the algorithm of multi classification is complex.The direct classification with no demand for training is proposed and implemented for the first time, which is based on transduction inference. The classification process of TSG is as follows: Assume that a test sample belongs positive (+) class, and the chi-square value of the combined test sample and training samples is denoted as Chi+; Assume that the test sample belongs to the negative (-) class, and the chi-square value of the combined test sample and training samples is denoted as Chi-; If Chi+> Chi-, the test sample will be assigned to positive (+) class.Otherwise, it will be assigned to negative (+) class.The multiclass classification can be realized by analogy. The feature selection process of TSG is as follows: It starts with the top two genes and adds additional gene, which has the best combined effect with selected genes, into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with leave-one-out cross validation.The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms other classifiers in most of the 19 datasets. In particular, In particular, the prediction accuracy of the training dataset is quite close to that of the independent test dataset. The accuracy of the test dataset is even higher than that of the training dataset, and the results show TSG can effectively control the overfitting by the direct classification which need not train.(3) Chi-square test-based integrated rank gene and direct classifier (?~2-IRG-DC) is developed for gene selection based on chi-square test and gene interactions. The feature selection process of ?~2-IRG-DC is as follows:First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes, and removed redundant genes according to leave-one-out cross-validation accuracy and the gain of chi-square value within the training set to informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with ?~2-DC. ?~2-IRG-DC method inherits the advantages of TSG, while greatly reduce the complexity of the feature selection by the weighted integrated rank of gene importance, and enhance the robustness of the feature selection by introducing chi-square gain as the second standard. The independent test accuracies of nine binary and ten multiclass tumor gene-expression datasets showed that ?~2-IRG-DC is obviously superior to the literature reported. As a feature selection method, ?~2-IRG-DC is better than mRMR, SVM-RFE, HC-K-TSP, TSG As a classifier, ?~2-DC is better than NB and KNN, and it is similar to the performance of SVM classifier.The proposed methods have important theoretical and practical value for promoting the feature selection of high dimensional data and classification of tumor.
Keywords/Search Tags:Tumor, Gene expression profile, High-dimensional feature selection, Support vector machine, Chi-square test, Direct classification
PDF Full Text Request
Related items