| Micro-RNA-seq and single cell RNA-seq(scRNA-seq)data have become an important basis for biological and medical research.It has attracted extensive attention of researchers to select characteristic factors from a large number of gene expression data for classification research.It is an effective method to use micro RNA-seq or sc RNA-seq data to diagnose disease types in medical research.Aiming at the above sequencing data,there are statistical classification methods such as Poisson linear discriminant analysis(PLDA),negative binomial linear discriminant analysis(NBLDA)and zero expansion Poisson logic discriminant analysis(ZIPLDA).Because the number of gene expression is thousands,the sample is only dozens,in a large number of genes,not all genes play a role in classification,there are a large number of redundant and unrelated genes in gene expression data.A typical method of gene expression data processing is to select feature genes.How to find and select genes that play a decisive role in sample classification is very important for the subsequent classification work.In order to enhance the accuracy of classification,save computing time and improve computing efficiency,it is necessary to remove irrelevant genes and detect important feature genes.At present,BSS/WSS method is widely used,but this method assumes that the data are normal distribution,so it may not be suitable for micro RNA-seq and sc RNA-seq data.To solve these problems,this thesis proposes a method of encoding categories and selecting differentially expressed genes by using the Spearman correlation coefficient.The correlation coefficient reflects the direction and degree of the change trend between the two variables,and the Spearman correlation coefficient is a statistical measure of the strength of monotonic relationship between paired data.We recode the class number of the samples in each class according to the size of the sample observations in each class,and get the new class number code in each class.By calculating the correlation coefficients of genes and new category numbers,the genes with larger correlation coefficients are selected,so as to ensure that the differences between the selected genes in the class are small,while the differences between the classes are large,which improves the efficiency and accuracy of classification.At the same time,we prove the screening certainty and rank consistency of the proposed ENTCmethod.We compare the ENTC method with the existing method of selecting feature genes.Simulations show that in many cases,the accuracy of selecting feature genes by ENTC method is higher than that by other methods,and the misclassification rate for classification is lower than that by other methods.In addition,by analyzing the actual data,the results also show that the ENTC method is better than other existing methods. |