Font Size: a A A

A Study On Feature Selection For Cancer Diagnosis Based On SELDI Protein Mass Spectrometry Data

Posted on:2012-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y J WangFull Text:PDF
GTID:2154330335962759Subject:Detection Technology and Automation
Abstract/Summary:PDF Full Text Request
Cancer is one of the most lethal diseases threatening human health. Researchers had found that protein levels of cancer patients have changed even when no any symptoms at the early stage. The protein which the level has changed was called biomarker. More cancer biomarkers are bring more hope to conquer cancer. Theoretically, if can discern the mysteries of protein, we can understand the occurrence and development mechanism of disease, such as cancer, and realize the early diagnosis and treatment of disease. But most of protein for cancer diagnosis are unnecessary and irrelevant, the remaining protein namely biomarkers. How to select the biomarkers is a hot spot of proteomics research.From the point of pattern recognition and machine learning, the selection of protein biomarkers can be classified as feature selection. Linear discriminant analysis is one of the classic pattern recognition feature extraction algorithm. However, SELDI-TOF-MS usually has the small sample size problem, where the dimension of the sample space is much larger than the number of the samples in the training set.This can lead to the singularity of within-class scatter matrix and algorithm failure. In addition, feature extraction with respect to feature selection, will transform features into the new areas, so that new features have more discrimination and more favorable for classification. But the resulting new features will be difficult to reflect the biological significance. In order to deal with above problems, this study consider frequency domain features of mass spectrometry, using wavelet transform to extract the detail information of data, and reduce the features dimension and computational efficiency. For maintaining the biological significance of selected biomarkers, we use null space LDA algorithm to solve the small sample problem and select the feature biomarkers. Applying the recursive frame to reduce the correlation of selected features and let the selected original protein biomarkers have both high classification rate and biological significance.The public ovarian and prostate cancer data, as well as breast cancer data provided by the Zhejiang Cancer Hospital were used to analysis and numerical experiments in the study. Base on the statement of results, we also compare the result with other classical algorithm in areas such as classification performance and correlation. The experimental results show that: 1) Compared with the classical algorithm, the subset of the features selected by the algorithm from the data sets not only has better classification performance, but also greatly reduces the correlation between the features. 2) The algorithm is able to pick out a few small protein biomarkers which has high discrimination performance and biological significance.
Keywords/Search Tags:Protein spectrum, Feature selection, Recursive Null Space LDA, Cancer classification, Protein biomarker
PDF Full Text Request
Related items