Font Size: a A A

Analysis Of Cancer Gene Data Base On Random Forest And Support Vector Machine

Posted on:2018-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:L F LiangFull Text:PDF
GTID:2334330542454027Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The cancer incidence rate in our country has been in a rising trend,Chen et al.(2016)[,in the article "Cancer Statistics in China,2015 "pointed out that he estimated China has 4 million 292 thousand new cancer cases in 2015,equivalent to a daily average of 12 thousand new cases.Because of the high mortality rate of cancer,the early diagnosis of cancer is very important to the treatment.With the continuous development of information technology and medical technology,thanks to the progress of gene microarray technology,new progress has been made in cancer research.At present,there are two main diagnostic methods for cancer.The first is the clinical diagnosis;the second is DNA,that is,gene microarray technology.The main principle of the technique is to label complementary nucleic acid sequence of target gene,and then make the tagged complementary nucleic acid sequence hybridized with the target gene.Next,the results of hybridization were observed by autoradiography or biochemical detection methods.Finally,the difference between the expression of the same gene in different tissues or cells is used in to cancer diagnosis.The disadvantage of the clinical diagnosis is that the early clinical symptoms of most cancer patients are not obvious,so many cancers are not found in the early clinical diagnosis,which delays the time of treatment.The DNA test is carried out by genes that can be well screened for cancer in the early stages.However,because of the large number of human genes,it is not an easy task to distinguish the function of each gene.This paper attempts to combine support vector machine method,the random forest and multivariate statistical backward elimination method to analysis colon cancer gene data,hope can find fewer cancer causing genes that can be screened for cancer.Random forests can be used to calculate the importance of each feature variable for classification,so we use the random forest method to select the characteristic variables.However,random forests select randomly samples and features in the process of building decision trees,the calculated feature impor-tance will be affected by the noise data,the more important feature may be become less important.In order to reduce the adverse effect of noise on the results,we combine random forest and multivariate statistics backward elim-ination method,namely,repeat the establishment of random forest,remove the percentage of a variables whose importance are less important and loop until the remaining features are needed.The determination of this percent-age should also take into account the influence of the characteristic quantities.After the feature variables were selected by random forest,the support vector machine was used for classification.In this paper,random forest,multivariate statistics backward elimination and support vector machine are combined to give full play to the advantages of each approach at different stages of process-ing.After the empirical analysis colon cancer gene data,the main conclusions are as follows:1.In the feature selection section,this paper uses t-test method,simple random forest,random forest backward eliminated three kinds of methods for feature selection of 2000 genes,and compared the first 20 genes screened by random forest backward elimination and t-test methods,only 8 of which are genetically identical,so there is a big difference between the results of the two methods of feature selection.2.In the classification section,this paper uses the support vector machines method to discriminate the test samples which are feature selected by t-test,simple random forest and random forest backward eliminate method.The re-sults show that the discriminant results of the simple random forest method are better than those of the t-test method,so the random forest method is superior to the t-test in feature selection.3.We use the random forest backward eliminate method to select the first 19 genes,and the classification accuracy is 100%when the samples were clas-sified,while the use of simple random forest and t-tset method results in the highest accuracy rate is only 90%.Therefore,the feature selection result of random forest backward eliminate method is better than t-test and simple random forest method.The random forest backward eliminate method is an improvement on the random forest method,which can reduce the feature set while improving the classification accuracy.4.We combine the random forest backward elimination method with the support vector machines method and achieve very good results in cancer data analysis.We achieved the goal of identifying cancer genes from a large number of genetic data,thus discriminating against cancer.The innovation of this paper are as follows:1.This paper combine the random forest,multivariate statistical backward elimination method and support vector machine method to analysis data.It gives full play to the advantages of random forest in feature selection and support vector machines in dealing with low dimensional nonlinear separable problems.2.On the basis of the simple random forest,we combine the backward elimination method to establish random forest repeatedly,and eliminate the percentage of the least importance of the variable until the number of feature is reduced to the number of the target feature.This,to some extent,reduces the adverse effects of simple random forest randomness and large amounts of noise on the result of feature selection.3.In the random forest with backward elimination method for feature se-lection process,We choose to delete-different percentage features depending on the number of features,to further improve the performance of random forest backward eliminated method for feature selection.
Keywords/Search Tags:Random Forest, Support Vector Machine, Feature Selection
PDF Full Text Request
Related items