Font Size: a A A

A Novel Method Of Nonlinear Rapid Feature Selection For High Dimension Features And Its Application On Bioinformatics

Posted on:2012-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z J DaiFull Text:PDF
GTID:2210330338951731Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Feature selection is one of the hot topics in data mining and pattern recognition. To enhance the generalization ability of models built on high dimensional data, it is necessary to remove irrelevant and redundant features. Selecting p (p≤m) sub-features from m features has 2m possibilities in theory, which is known as a non-deterministic polynomial complete problem that can not be enumerated exhaustively when m is very large. In this dissertation, in order to overcome the shortcomings of local optimum of existing non-exhaustive feature selection methods, Advantages of support vector machine are fully used to develop a new rapid nonlinear feature selection method for high-dimensional feature vector. For classification problems, since the existing feature selection algorithms lack the ability to assess feature importance generally and reliably, we developed a novel feature significance test method which is named paired sample t test based on pseudo support vector regression.The functions of peptide and. protein are determined by their primary sequences in essence. It is very time-consuming to determine protein tertiary structure experimentally, but the primary structures of peptide and protein can be easily determination. So the quantitative sequence-activity relationship (QSAR) study of peptides and proteins is extremely important and has wide prospects of applications in the development of peptide drugs and revealing the relationship between protein structure and function. One important aspect of the QSAR modeling of protein and peptide is the primary structure characterization. In this paper, the peptide structure is represented directly by 531 physical and chemical properties of amino acids, the new approach is applied to two peptide systems'QSAR modeling (belongs to regression), bitter tasting bipeptides and angiotensin converting enzyme inhibitors.10 descriptors with clear meaning are reserved, respectively. We established models with the reserved descriptors based on support vector regression (SVR), the accuracies of fitting, leave-one-out cross validation and external prediction are increased substantially compared to the results reported in the literature. Furthermore, to enhance the interpretability of the model, significant test for the SVR models and single-factor relative importance, single-factor effect analysis were carried out respectively.Gene expression profile data (belongs to classification) of Cancer and other complex diseases has many characteristics such as small sample size, high dimension, high noise, high redundancy, nonlinear and so on. How to mine information deeply from them is the focus and difficulty of bioinformatics. The new method mentioned was applied to two gene expression profiles of acute leukemia and colon cancer, with the result that 6 and 4 candidate genes were selected, respectively. We established models with candidate genes based on support vector classification (SVC), the accuracies of leave-one-out cross validation, total fold cross validation and independent test are equal or superior to the results reported in the literature. Paired sample t test for the candidate genes on the SVC models was carried out based on pseudo SVR, then the relative importance order of candidate genes were given.Protein-protein interactions play a key role in the function of cells and biological pathways. Understanding these interactions is important for clarifying the pathogenesis of various diseases and treatment. To further validate the effectiveness of feature selection of the new method on large complex high-dimensional sample, we applied it to all sample data of human protein interaction database (HPRD). First, feature extraction on one protein interaction pair resulted in a 686 dimensional feature vector. The support vector machine modeling has high time complexity when the training set is too large, so we used another classifier named relaxed variable kernel density estimation (RVKDE) for feature selection, and ultimately retained 232 features. We established a model with reserved features based on RVKDE, the accuracy of independent test is slightly higher than the results reported in the literature.In conclusion, our novel method has wide application prospects in regression field of peptide QSAR modeling, classification field of small high-dimensional samples of tumor gene expression profiling, and large complex high-dimensional samples.
Keywords/Search Tags:rapid nonlinear feature selection, high-dimensional features, quantitative sequence-activity relationship, gene expression profile, protein-protein interactions, support vector machine
PDF Full Text Request
Related items