Font Size: a A A

Biomarkers Selection And Classification For Cancer Based On Protein Mass Spectrometry Data

Posted on:2015-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:K WangFull Text:PDF
GTID:2284330482466926Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Proteomic mass spectrometry is one of currently widely used technology in protein research area, especially in the early diagnosis and biomarkers recognition of cancer. The huge mass spectrum data needed to be further analysis to achieve the qualitative and quantitative study of protein. Therefore, the research method based on bioinformatics for the mass spectrum data has become a key. By comparing and analysis the protein extracts obtained from the cell tissue of case group and control group, we can find abnormal biomarkers which play an important role in disease pathology, so as to classify the case group and the control group correctly.Tumor protein mass spectrometry data has many significant characteristics such as small sample, high dimension, and big noise and nonlinear, et al. Mining credible protein biomarkers has an important significance for early diagnosis of tumor and revealing the pathogenesis. First, we consider the strength values difference of features (kurtosis values) in the case group and control group from the longitudinal and transverse directions. Specially, the data sets can be regarded as mixed level experiment of two factors, where factor A as the sample labels (including two levels in which one is case group and the other is control group), where factor B as kurtosis values(including m levels in which one feature represents one level). By applying the unbalanced two-way analysis of variance on the tumor protein mass spectrometry data, we developed a new high-dimensional feature selection method named as Top Score Feature Subset based on F test (TSFS-F), and meanwhile, we put forward a new classification method named as Direct Inference Classifier based on F test (DIC-F). To evaluate the effectiveness of our method, we referred 2 feature selection methods including SVM-MRMR and SVM-SVMRFE. Since the MRMR and SVMRFE can only obtain the significance order of features and cannot give the specific number of feature subset, so we use SVM to perform the merits of involved features by conducts 10-fold cross validation on the training set, and then the corresponding top several features with the highest prediction accuracy of cross validation as the optimal feature subset.The experiment results of 2 tumor protein mass spectrometry data with 10 times repeated show that:1) the number of optimal feature subset selected by TSFS-F is relatively little and stable; 2) the independent prediction accuracy are better than that of the reference feature selection methods by couple with the KNN, NB, and SVM classifier, and also improves the over fitting of SVM effectively, which demonstrates the robust of our method.3) Comparing the DIC-F with other reference classifiers, the classification accuracy is superior to KNN and NB and slightly weaker than SVM, but it is worth noting that the accuracy obtained by DIC-F couples with TSFS-F is superior to the combinations of other features selection methods and classifiers. TSFS-F and DIC-F have broad application prospects in high-dimensional feature selection fields such as biomarkers selection and classification of complex diseases.
Keywords/Search Tags:Tumor, Protein mass spectrometry, High-dimensional feature selection, Analysis of Variance, Classification
PDF Full Text Request
Related items