Font Size: a A A

Multivariate Data Analysis Methods Based On Feature Seletion And Their Applications In Spectroscopic Study

Posted on:2012-10-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M J ZhangFull Text:PDF
GTID:1480303353976409Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Feature selection is one of the most important aspects of multivariate data analysis. Through feature selection, both of the redundant and irrelevant information can be eliminated and the data dimensionality can be reduced, so that the computational processing is simplified. Furthermore, it can improve the generalization performance and understandability of models. Thus, feature selection plays an important role in data analysis.This dissertation studied the feature selection methods for high dimensional data, the proteomic mass spectrometric (MS) data and near-infrared spectroscopic (NIRS) data were taken as research object. The main aims for proteomic MS data analysis was potential biomarker finding and samples classification, for NIR data analysis was wavelength selection for elimination of the effect of co-linearity and effective modeling.The main works in this dissertation are as follows:(1) A feature selection method called ULDA-HFS (uncorrelated linear discriminant analysis based heuristic feature selection) was proposed, which mainly include three steps:(a) dimensionality reduction and data normalization; (b) data binning and discriminant bin selection; (c) ULDA for feature selection and sample classification. An ovarian cancer serum SELDI-TOF (surface enhanced laser desorption/ionization-time of flight) MS dataset was analyzed with the proposed method, and obtained several potential biomarkers which could discriminate ovarian caner samples from healthy samples, the classification model built by the potential biomarkers obtained 100% of specificity and sensitivity.(2) A strategy based on Independent Component Analysis (ICA) and ULDA was proposed for proteomic profile analysis and potential biomarker discovery from proteomic mass spectra of cancer and control samples. The method mainly includes 3 steps:(a) ICA decomposition for the mass spectra; (b) selection of discriminatory independent components (ICs) using nonparametric test; and (c) selection of special peaks (m/z locations) as potential biomarkers and create classification models by ULDA.. A colorectal cancer data set and an ovarian cancer data set were analyzed with the proposed method. The classification results yielded 100% and 96.77% of specificities on colorectal and ovarian cancer datasets respectively,100% of sensitivity on both of the datasets.(3) A feature selection method based on F-score and partial least square-discriminant analysis (PLS-DA) was presented. After preprocessing, peaks consist in the signals were picked and the variables were sorted according to their F-scores, then, potential biomarkers were selected by performing PLS-DA in forward selection strategy. The classification results of the potential biomarkers selected by the proposed method yielded 100% of specificity and 95.24% of sensitivity on a colorectal cancer dataset, and 96.77% of specificity and 100% of sensitivity on an ovarian cancer dataset.(4) Proposed a feature selection method named Monte Carlo Sampling-based Recursive Partial Least Squares (MCS-RPLS), which create a number of sub-dataset by using Monte Carlo sampling technique firstly, then modeling with PLS on each subset repeatedly and select feature subset on each dataset by taken regression coefficient as criterion, finally determine the optimum feature set through statistical analysis on the feature subsets. The method was used for analysis of several NIR datasets and compared with several methods, the results shown that the method could effectively select useful features from NIR data for multivariate calibration.(5) A feature selection method based on purity of spectral variable was proposed and used for wavelength selection from NIR dataset for quantitative modeling. After calculation of the purity of each spectral variable (i.e. wavelength), sort the variables using purities in descendent way and select optimum variables step by step, where the contribution of each variable for calibration model was tested with PLS cross validation. The method was used for analysis of several NIR datasets and the results indicated its simplicity and availability.
Keywords/Search Tags:Feature selection, Proteomics, Biomarker, Near-infrared spectroscopy, Multivariate data analysis
PDF Full Text Request
Related items