Font Size: a A A

SRBCT Subtype Recognition Based On Gene Expression Profiles

Posted on:2006-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZhuFull Text:PDF
GTID:2144360155460911Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The molecular-level analysis and research of tumors based on gene expression profiles is an important field of bioinformatics and has recently received a great deal of attention in the context of DNA microarray. In this work, from the view of information and system science, we were committed to exploring the problem of SRBCT subtype discrimination and feature selection using methodology of AI and computing technology. The Small Round Blue Cell Tumors (SRBCT) was taken as a case and the research results are achieved as as follows: A new method for SRBCT feature gene selection was proposed in this work. After modifying the "Signal to Noise ratio"measure, we proposed a new measure known as weighted Bhattacharyya distance as a criterion of screening predictive genes for SRBCT subtype classification. 152 genes were chosen by this criterion and formed the feature set whose subsets would be applied to classification. The three predictive models by two strategies in tackling the multicategory problem were constructed based on the prescreened genes. They were BP network model, Multiple Models Prediction Model based on several SVMs (MMPM) and Multicategory Support Vector Machines (MSVM). The comparisons through three methods suggested that the MSVM model was superior to the others. The superiority of the MSVM was demonstrated by the fact that 100% accuracy was achieved using the top 25 genes ranked by the weighted Bhattacharyya distance. Considering the existence of strong correlation between some genes, a pairwise analysis of the feature genes based on Pearson correlation coefficients was addressed and utilized to remove redundant genes from the subset of 25 genes. As a result, 15 genes remain and were regarded as the final subset. The trained MSVM based on the 15 genes of this subset was able to achieve 100% accuracy on the training and blind test dataset. An SOM network was used to invetstigate the clustering capability of the 15 features. The clustering model could gather all SRBCT samples in four clusters without any error which correspond to four subtypes. Compared with Khan et al's method, which required 96 genes, we achieved equal accuracy with a smaller subset composed of the 15 feature genes, which demonstrates efficiency and feasibility of the methods and the predictive models proposed in this work. This work was supported by National Nature Science Foundation of China.
Keywords/Search Tags:Gene Expression Profiles, Tumor Subtype Classification, Feature Selection, SRBCT, Multicatogory Support Vector Machines
PDF Full Text Request
Related items