Research On Application Of Support Vector Machine And K-Nearest Neighbor Learning In Cancer Classification Based On Multiclassification ROC Evaluation

Posted on:2022-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:H Tong

Full Text:PDF

GTID:2504306560999109

Subject:Public Health

Abstract/Summary:

PDF Full Text Request

Objective: 1.Explore the effect of maximum information coefficient combined with recursive feature elimination method for feature selection in multi-class ROC evaluation,2.Evaluation of the classification effect of support vector machine and K nearest neighbor learning model based on multi-class ROC.Methods: First,the words such as colorectal cancer,colorectal tumor,colorectal adenoma,colon cancer,and rectal cancer were used as keywords for corresponding searches on the GEO platform,and the inclusion and exclusion criteria were used to select appropriate data,calculate the maximum information coefficient of a single gene,and use the coefficient greater than or equal to 0.4817 was set as the threshold to perform feature extraction;then,the 5-fold random forest cross-validation recursive feature elimination method was used for the feature selection,Furthermore,train support vector machine and K-nearest neighbor,where the ratio of the training set to the validation set was 7:3.Finally,area under multiclass ROC curve was calculate by using macro average method.The above data processing and analysis performed in the software Python 3.7.Results: A total of 5 data sets were collected,namely GSE10714,GSE37364,GSE41657,GSE50114,GSE50115,among which GSE50114 and GSE50115 are different data sets from the same experiment.There are 7 cancer samples,5 adenoma samples and 3 normal samples in the GSE10714 data set.The GSE37364 data set contains 27 cancer samples,29 adenoma samples,and 38 normal samples.There are 25 cancer samples,51 adenoma samples and 12 normal samples in the GSE41657 data set.The two data sets GSE50114 and GSE50115 have a total of 9 cancer samples,37 adenoma samples and 9 normal samples;there are 9827 identical genes in the five data sets.In feature extraction,55 genes were selected after the maximum information coefficient,and 51 genes were selected in the feature extraction of the subsequent cross-validation recursive feature elimination method.The 51 genes were ACAT1,ADAMDEC1,ADH1 C,AHCYL2,AJUBA,APPL2,C1 QC,C5orf30,CA2,CASP7,CDH3,CHGA,CHP2,CLDN1,COL1A1,CXCL3,DHRS11,FBLIM1,GDF15,GLA,GLTP,GNA11,GNA13,GTF2IRD1,HPGD,HSD11B2,ISX,MAOA,MMP7,MPEG1,NEBL,NFE2L3,NR3C2,PHF19,PHLDA1,PPAP2 A,PXMP2,RNF43,S100A2,SLC29A1,SLCO4A1,SMPDL3 A,SORD,SPPL2 A,STAP2,STX12,SULT1A1,TNS4,TPD52L2,TSPAN7,UGP2.When the characteristic genes were put into the linear support vector machine,and the area under the curve(AUC)by using macro average is 0.9710(Areas are 0.9857,0.9632 and 0.9412 for normal,adenoma and cancer samples,respectively).When the original data without feature selection is put into the support vector machine,and AUC by using macro average is 0.9627(Areas are 0.9823,0.9389 and 0.9389 for normal,adenoma and cancer samples,respectively).The characteristic gene and the original data without feature selection are put into the support vector machine,and AUC calculated by the macro-average method is statistically different(P<0.05).When the selected feature genes were put into the K-nearest neighbor learning model for classification,AUC by using macro average is 0.9555(Areas are 0.9895,0.9319 and 0.8998 for normal,adenoma and cancer samples,respectively),and when the original data without feature selection is put into the K-nearest neighbor learning model,and AUC by using macro average is 0.9496(Areas are 0.9895,0.9191 and 0.8773 for normal,adenoma and cancer samples,respectively).The characteristic gene and the original data without feature selection are put into the K-nearest neighbor learning model,and AUC calculated by the macro-average method is statistically different(P<0.05).Characteristic genes were put into the support vector machine and the K-nearest neighbor learning model.The difference in AUC calculated by the macro-average method was not statistically significant(P>0.05)Conclusion: Multi-class ROC is well applied in future multi-classification;the combination of maximum information coefficient and recursive feature elimination method for feature selection can improve machine learning models performance;the application of using support vector machine and K-nearest neighbor learning model for classification in multi-classification data has good results.

Keywords/Search Tags:

multi-classification, ROC, support vector machine, K-nearest neighbor learning, machine learning

PDF Full Text Request

Related items

1	Research On Medical Image Mining Based On Improved Multi Kernel Support Vector Machine
2	Support Vector Machine Broad Learning System For Early Diagnosis Of NPSLE By Multimodel MRI
3	Applications Of Machine Learning In Brain-Computer Interface
4	Tumor Subtype Multi-class Classification And Analysis Based On Gene Expression
5	Application Of Improved Support Vector Machine To The Diagnosis Of Benign And Malignant Breast Tumors
6	A Machine Learning-based Approach To Building Predictive Models For The Field Of Traumatic Brain Injury
7	Application Of Support Vector Machine In TCM Syndrome Classification
8	Classification Of Breast Tumor Ultrasound Images Based On Machine Learning And Clinical Features
9	Prognostic Model Of Liver Cancer Based On Machine Learning Method
10	Application Of Multi-label Support Vector Machine In X-RAY Lung Disease Detection