| Objective: 1.Explore the effect of maximum information coefficient combined with recursive feature elimination method for feature selection in multi-class ROC evaluation,2.Evaluation of the classification effect of support vector machine and K nearest neighbor learning model based on multi-class ROC.Methods: First,the words such as colorectal cancer,colorectal tumor,colorectal adenoma,colon cancer,and rectal cancer were used as keywords for corresponding searches on the GEO platform,and the inclusion and exclusion criteria were used to select appropriate data,calculate the maximum information coefficient of a single gene,and use the coefficient greater than or equal to 0.4817 was set as the threshold to perform feature extraction;then,the 5-fold random forest cross-validation recursive feature elimination method was used for the feature selection,Furthermore,train support vector machine and K-nearest neighbor,where the ratio of the training set to the validation set was 7:3.Finally,area under multiclass ROC curve was calculate by using macro average method.The above data processing and analysis performed in the software Python 3.7.Results: A total of 5 data sets were collected,namely GSE10714,GSE37364,GSE41657,GSE50114,GSE50115,among which GSE50114 and GSE50115 are different data sets from the same experiment.There are 7 cancer samples,5 adenoma samples and 3 normal samples in the GSE10714 data set.The GSE37364 data set contains 27 cancer samples,29 adenoma samples,and 38 normal samples.There are 25 cancer samples,51 adenoma samples and 12 normal samples in the GSE41657 data set.The two data sets GSE50114 and GSE50115 have a total of 9 cancer samples,37 adenoma samples and 9 normal samples;there are 9827 identical genes in the five data sets.In feature extraction,55 genes were selected after the maximum information coefficient,and 51 genes were selected in the feature extraction of the subsequent cross-validation recursive feature elimination method.The 51 genes were ACAT1,ADAMDEC1,ADH1 C,AHCYL2,AJUBA,APPL2,C1 QC,C5orf30,CA2,CASP7,CDH3,CHGA,CHP2,CLDN1,COL1A1,CXCL3,DHRS11,FBLIM1,GDF15,GLA,GLTP,GNA11,GNA13,GTF2IRD1,HPGD,HSD11B2,ISX,MAOA,MMP7,MPEG1,NEBL,NFE2L3,NR3C2,PHF19,PHLDA1,PPAP2 A,PXMP2,RNF43,S100A2,SLC29A1,SLCO4A1,SMPDL3 A,SORD,SPPL2 A,STAP2,STX12,SULT1A1,TNS4,TPD52L2,TSPAN7,UGP2.When the characteristic genes were put into the linear support vector machine,and the area under the curve(AUC)by using macro average is 0.9710(Areas are 0.9857,0.9632 and 0.9412 for normal,adenoma and cancer samples,respectively).When the original data without feature selection is put into the support vector machine,and AUC by using macro average is 0.9627(Areas are 0.9823,0.9389 and 0.9389 for normal,adenoma and cancer samples,respectively).The characteristic gene and the original data without feature selection are put into the support vector machine,and AUC calculated by the macro-average method is statistically different(P<0.05).When the selected feature genes were put into the K-nearest neighbor learning model for classification,AUC by using macro average is 0.9555(Areas are 0.9895,0.9319 and 0.8998 for normal,adenoma and cancer samples,respectively),and when the original data without feature selection is put into the K-nearest neighbor learning model,and AUC by using macro average is 0.9496(Areas are 0.9895,0.9191 and 0.8773 for normal,adenoma and cancer samples,respectively).The characteristic gene and the original data without feature selection are put into the K-nearest neighbor learning model,and AUC calculated by the macro-average method is statistically different(P<0.05).Characteristic genes were put into the support vector machine and the K-nearest neighbor learning model.The difference in AUC calculated by the macro-average method was not statistically significant(P>0.05)Conclusion: Multi-class ROC is well applied in future multi-classification;the combination of maximum information coefficient and recursive feature elimination method for feature selection can improve machine learning models performance;the application of using support vector machine and K-nearest neighbor learning model for classification in multi-classification data has good results. |