With the development of various sequencing technologies,a large amount of gene expression data has been generated,and the use of traditional biological methods for gene expression data analysis has been unable to meet social needs.In recent years,researchers have introduced machine learning theories and methods into the field of bioinformatics,and through the comprehensive analysis of gene expression data to discover important information contained in biology.Aiming at the characteristics of tumor gene expression data and taking machine learning as the starting point,this paper proposed a series of data analysis algorithms for tumor gene expression data through the research and exploration of such issues as characteristic gene selection,tumor sample classification and tumor clustering.The main research contents are as follows:1.Tumor characteristic gene selection based on deep learning and matrix decomposition.Firstly,in view of the inability of the deep learning model to select tumor characteristic genes,we propose sample learning based deep sparse filtering method for tumor characteristic gene selection.Secondly,based on the optimal mean algorithm and the block optimization theory,we propose the optimal mean-based block robust characteristic gene selection method to analyze the integrate data in TCGA.Finally,the class label information is added into the unsupervised algorithm by using the scatter matrix,supervised penalty matrix decomposition algorithm is proposed for characteristic gene selection.2.Tumor sample classification based on sample expansion and deep learning.Aiming at the problem that the training samples are seriously insufficient when using deep learning model to implement tumor sample classification,the sample expansion method based on denoising autoencoder is proposed to obtain a large number of auxiliary samples.Furthermore,by combining the sample expansion method with two deep learning models,a sample expansion-based stack autoencoder model and a sample expansion based 1D convolutional neural networks model are designed for tumor sample classification.3.Tumor samples clustering based on low-rank subspace segmentation.In order to cluster the tumor gene expression data,the traditional subspace segmentation method needs to rely on the spectral clustering method.To deal with this problem,based on the discrete constraints to directly learn the sample labels of the subspace,two low-rank subspace tumor sample clustering methods are proposed.Firstly,considering the manifold structure inside the tumor gene expression data,we propose a low-rank subspace clustering algorithm based on discrete constraint and hypergraph regularization.Secondly,in order to eliminate the influence of outliers in tumor data,a robust low-rank subspace clustering algorithm based on discrete constraint and capped norm is proposed to improve the robustness of the algorithm.4.Biclustering of tumor data based on dual hypergraph regularization principal component analysis.Considering the sample manifold structure and gene manifold structure in the tumor data simultaneously,the sample hypergraph and gene hypergraph are constructed respectively to obtain the local geometric information of the data,and the dual hypergraph is used as the regularizer of principal component analysis for sample clustering and gene clustering.Then we propose a dual hypergraph regularization principal component analysis algorithm to biclustering the tumor gene expression data.Experimental results on multiple tumor gene expression datasets verify the effectiveness and superiority of the proposed algorithm. |