| With the continuous development of biomedicine,DNA microarray sequencing technology has been greatly promoted,and a large number of general data of cancer gene expression have been generated.It is of great significance for cancer treatment,diagnosis,pathology,prognosis and research to analyze and mine important characteristics of samples from cancer gene expression profile data for sample classification.Cancer gene expression profile data usually contains the gene expression information of thousands of cells,but only a small number of genes are related to sample classification.Meanwhile,the general data of cancer gene expression is characterized by high noise,error and small sample size,and some gene expression profile data contain multi-level information,which makes the classification and processing of samples too complicated.The feature selection method is very effective for the analysis of gene expression profile data and the selection of informative feature genes for the classification of cancer subtypes.Therefore,it is of great exploratory significance and practical value for cancer classification and clinical treatment to find out the feature selection model effective for sample classification or clustering from thousands of genes by seeking for strong generalization ability and making full use of the data information of cancer gene expression profile.Based on different levels of cancer gene expression spectrum data,this thesis proposes two improved feature selection algorithms: one is the feature selection algorithm improved by a single gene expression spectrum data.The accuracy of the algorithm.The second is the feature selection algorithm based on cancer genome data.The genome data integrates gene expression data,copy mutant data,and methylation data of each data.Views have fully considered the potential correlation between multiple groups of learning data,so that it can effectively improve the data sample classification or clustering results.And conduct a comparative experimental analysis of the two improved algorithms to verify the effectiveness of the improvement algorithm.The main tasks of the completion are as follows:(1)Aiming at the problem that a large number of irrelevant or weakly correlated features in cancer gene data affect the classification accuracy of samples,a feature selection method of tumor data based on improved non-negative matrix decomposition is proposed.By introducing hypergraph regularization term to preserve the low-dimensional popular structure of data,and introducing label information and L2,1 norm into the objective function,this method improves the recognition ability of decomposition matrix,reduces the sensitivity to noise and outliers,and enhances the robustness of the algorithm.Experimental results show that this method can effectively eliminate redundant or irrelevant features and improve the classification accuracy of cancer samples.(2)Aiming at the problem that the potential local prevalent structure of data cannot be mined based on low-rank representation and the consistency and complementarity among cancer genome data are ignored,a multi-view feature selection method based on manifold regularization and low-rank representation is proposed to comprehensively analyze genomic data.By introducing graph regularization,this method makes full use of the local geometric structure of data and improves the learning ability of low-rank matrix to the local feature information of data.At the same time,sparse symmetry constraint and block constraint are introduced to balance the influence of noise on the low rank representation algorithm in multiple groups of cancer data.The feature is evaluated by scoring function,and the feature subset is selected for clustering.Finally,the proposed algorithm is applied to cancer genome data to verify the effectiveness of the proposed algorithm.Two improved algorithms proposed in this paper have been used to verify the rationality and effectiveness of the proposed methods on multiple cancer data sets,respectively.The experimental results show that the proposed algorithm has obvious advantages over the existing methods,has better feature selection effect,and can obtain better classification or clustering results. |