| In recent years,human diagnosis and treatment of tumors have not stopped at a simple apparent stage.People are eager to explore the fundamental genetic roots of tumor generation and metastasis.With the widespread application of chip and high-throughput sequencing technology,more and more genomic data have been accumulated in the field of tumor research,which has promoted the development of precision medicine.These include the use of feature selection techniques to mine biomarkers(such as information genes)to further analyze the pathology of cancer and the development of targeted drugs.This article focuses on the theme of feature selection of tumor genomic data.Based on the existing feature selection algorithms and research,several new feature selection algorithms,strategies and applications for tumor genomic data are proposed:(1)Primary selection of information genes based on t test and fold analysis.Nowadays,combining t-test and fold analysis to identify differentially expressed genes is a very common method for analyzing gene expression data.In traditional methods,it is necessary to set the p-value obtained by the t test and the threshold of fold change,and then find the union of genes that meet the threshold as the differentially expressed gene.The selection conditions have been improved in this article.The distance formula was designed according to the characteristics of tumor genome data and the selection requirements for differentially expressed genes.The genes that meet the threshold are further sorted and screened by the distance formula.In addition,three sets of gene expression data were preliminarily selected for information genes,and several up-regulated and down-regulated differentially expressed genes were obtained.(2)Selection of information genes based on genetic algorithms.For data sets with a large number of information genes after primary selection,gene selection is also required to obtain fewer information genes.In this paper,the genetic algorithm that uses the linear combination of the posterior probability and the empirical error rate of the linear classifier as the fitness function searches the initially selected up-regulated and down-regulated differentially expressed genes to achieve the information gene subset Two types of separability are maximized.According to the characteristics of the tumor genome data,the parameters are optimized to obtain a certain number of selected information gene subsets.(3)The heuristic width-first search feature selection algorithm HBSA-NRS based on improved neighborhood rough set.The heuristic breadth-first search algorithm(HBSA)is widely used in feature selection,but because of the large number of nodes involved in the calculation,HBSA is very time-consuming.To solve this problem,we propose a heuristic based on the improvement of neighborhood rough sets Width-first searchalgorithm(HBSA-NRS).According to HBSA-NRS,when expanding each node of the search tree,the importance of all candidate features is calculated according to the neighborhood rough set theory and the importance threshold of the feature is set,and the features greater than the given threshold are selected as child nodes for expansion,and used SVM calculates the classification accuracy of the feature subset represented by the path of each layer of nodes as heuristic information,and selects the first several nodes as the parent node to be expanded in the reverse order,which further reduces the number of nodes in the search tree.This strategy greatly reduces the number of nodes in the search tree,thereby reducing the running time of the algorithm. |