| In early 2017,the National Cancer Center pulishes China’s latest cancer data,which shows that the number of new cancer cases in China is on the rise,and the situation remains severe.DNA microarray technology can obtain a lot of gene expression profile,which provides the reliable data source for tumor feature gene selection and tumor subtype classification.However,there are noise and redundant genes in the gene expression profile obtained by this technique,which can affect the accuracy of tumor subtype classification.Tumor feature gene selection not only efficiently selects highly related genes but also reduces the cost of tumor subtype classification.Gene expression profile has the characteristics of high dimension,small sample and noise,which brings great challenges to data analysis and processing.Based on the gene expression profile obtained by DNA microarray technology,this paper explores the tumor feature gene selection method with high generalization ability and classification accuracy by means of machine learning.The main contents of this paper are as follows:(1)Tumor feature gene selection method based on PCA and Information Gain.Because of the traditional principal component analysis algorithm does not take the category information of sample data into account,the genetic data information can not be used effectively,the selected feature genes set still contains some redundant information,which leads to the low classification accuracy of the data,a tumor feature gene selection method based on PCA and Information Gain is proposed.Firstly,PCA is used to dematerialize the original genetic data set,and the genes with high contribution rate are selected.Then the Information Gain algorithm is used to eliminate the redundant information of the pre-selected feature genes subset,and the genes subset is constructed according to the Information Gain value of each pre-selected feature gene.The experimental results show that this method can select tumor feature gene quickly and efficiently,furthermore the classification effect reaches expectations.(2)Tumor feature gene selection methods based on Information Gain and Neighborhood Rough Set.The gene expression profile contains a lot of redundancies,the classification results can be affected by noise during data processing,which leads to some existing tumor feature gene selection methods with weakclassification ability and poor robustness,and a tumor feature gene selection method based on Information Gain and Neighborhood Rough Set is proposed.Firstly,the Information Gain algorithm is used to calculate the Information Gain value of each gene.After descending order,the gene with the greatest Information Gain value is selected.The genes set with the greatest correlation between Spearman Correlation Coefficient and the maximum Information Gain value are used as the subset of preselected feature genes.Then,the Neighborhood Rough Set is used to extract the feature genes subset of the preselected feature genes,and the sequence forward search algorithm is used to select the more important genes.Experiments show that this method has higher classification accuracy than other related methods,and the smaller scale of selected subset of feature genes is obtained. |