Font Size: a A A

A Study Of Tumor Classification Method Using Bayes Na(?)ve Classifier Based On The Maximum Relevance Minimum Redundancy Feature Selection Method

Posted on:2018-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:J P ChenFull Text:PDF
GTID:2334330536972250Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective: Study of tumor occurrence and development mechanism using gene expression profiles helps to the diagnosis and personalized treatment of tumor patients.However,the large amount of genes tested,the huge test charges,the difficulty of sample collection,etc.result in the high dimension and small sample size problem.Additionally,the gene expression data has some important features,such as huge noises,huge redundancy and non-equilibrium distribution.Traditional classification methods do not applicable for the analysis of those data and face unprecedented challenges.The combination of feature selection and classifier is a good train of thought for this problem.Our study considered Bayes na?ve classifier based on the maximum relevance minimum redundancy feature selection method(m RMR-NBC),which was applied to the simulation data,the open gene expression profile and the gene expression profile of clinical tumor samples.m RMR-NBC was systematically compared with other classification methods to demonstrate the advantage of this train of thought and provide the theoretical basis for the classification of clinical tumor samplesMethods:(1)Simulation study was adopted.m RMR-NBC was applied to the classification problem of high dimensional data,and was compared with support vector machine?extreme learning machine and random forests.The effects of sample size?the number of genes and signal-noise ratio on classification accuracy were also explored in our study.(2)The open colon and lung gene expression profiles were applied to compare the classification accuracy of m RMR-NBC?support vector machine?extreme learning machine and random forests to verify the results of the simulation study.(3)GSE10245 gene expression dataset related to the non-small cell lung cancer was downloaded from gene expression omnibus(GEO)database,which includes 40 lung adenocarcinoma and 18 lung squamous cell carcinoma tissues.Maximum Relevance and Minimum Redundancy Na?ve Bayesian Classifier was used to select feature genes after pretreatment.The shortest path analysis with Dijkstra's algorithm was applied to select candidate genes.Gene Ontology(GO)and Kyoto Encyclopedia of Genes and Genomes(KEGG)pathway enrichment analysis was also performed.Literature review method was adopted to analyze the role of genes that contribute to the classification of the samples in the tumor occurrence and development.Results:(1)The overall classification accuracy of m RMR-NBC reached 96.71%,which was equal to the support vector machine,and was higher than random forests and extreme learning machine.The correlation coefficients between the classification accuracies of the classification methods mentioned above and the sample sizes were statistically significant(P<0.05).The classification accuracy of m RMR-NBC?support vector machine and extreme learning machine was negatively correlated with sample size,however,the classification accuracy of random forests was positively correlated with sample size.Additionally,the classification accuracy of random forests was negatively correlated with the number of genes.The correlation between classification accuracy of m RMR-NBC and the number of genes was not detected in this study.The factorial design ANOVA test demonstrated that sample size had an effect on the classification accuracy of m RMR-NBC(P<0.05).(2)The classification accuracy reached 95.16% and 97.26% respectively when the top fifteen and twelve genes in colon and lung gene expression profiles were added into the model.The m RMR-NBC can reach high classification accuracy when only few genes were added into the model.The classification accuracy would remain stable with more genes added.The classification accuracy of support vector machine in colon and lung gene expression profile reached 90.32% and 94.52%,the extreme learning machine was 82.26% and 69.86%,and the random forests was 81.98% and 77.62%.(3)Eight genes and twenty one genes were selected using m RMR-NBC and shortest path analysis,respectively.AURKA and SLC7A2 were selected for three and two times in the shortest path analysis,respectively.Those genes participate in many important pathways such as the oocyte meiosis,cell cycle,and pathways in cancer.Conclusion: The m RMR-NBC is applicable for the analysis of high dimensional and small sample size data.It can reach high classification accuracy when only few features is added into the model and is better than random forests and extreme learning machine.The m RMR-NBC can accurately select the tumor related genes,which helps to the exploration of the role of genes in the tumor occurrence and development and promote the development of precision and personalized medicine.
Keywords/Search Tags:Gene expression profiles, Tumor classification, Feature selection, Machine learning
PDF Full Text Request
Related items