| With the development of The Cancer Genome Atlas(TCGA)program and The Human Cell Atlas(THCA)program,huge amounts of biomics data have been generated.These omics data contain important information about biological functions and gene regulation.And mining research on it can provide effective help for exploring the emergence,prevention and treatment of diseases.Biomics data are characterized by small samples with high dimensions.As an effective dimensionality reduction technique,matrix factorization method has been widely concerned by many scholars.However,when the data contain noises and outliers or the manifold structure of the data is ignored,the performance of traditional matrix factorization methods is easily affected.This paper aims to complement and improve the existing non-negative matrix factorization(NMF)and low-rank representation(LRR)methods,and apply them to omics data successfully.The specific research contents are as follows:(1)Aiming at the characteristics of high dimensionality and manifold structure of biomics data,a robust non-negative matrix factorization method based on graph regularization(GrRNMF)is proposed.This method imposes graph regularization constraint to take into account the internal connections between data samples,which will make full use of the pairwise geometric information contained in the data.Then,the Gaussian noise and sparse noise are separately modeled to solve the problem that the dimensionality reduction performance of the data is affected by sparse noise.In addition,adding sparse constraints to the objective function makes the results more accurate.Finally,the Gr RNMF method is applied to gene expressio n data for analysis and verification.(2)Aiming at the complex connection among sample points in biomics data,a hypergraph regularized non-negative matrix factorization method based on L2,1 norm(RHNMF)is proposed.This method performs robust and manifold constraints on NMF.When estimating residuals,L2,1norm constraint is used so that the error function is no longer in the form of squared residuals.This will suppress the effects of noises and outliers.Then,by adding a hypergraph regularization constraint to the objective function,RHNMF can consider complex high-order relationships among more data sample points.In other words,this will dig deeper into the information contained in the data and improve the performance of the method.Finally,the RHNMF method is applied to the integrated gene expression data for clustering and feature selection.(3)Aiming at the problem of noises and outliers in biomics data,a novel method called correntropy-based hypergraph regularized non-negative matrix factorization(CHNMF)is proposed.Specifically,the correntropy measure is used instead of the Euclidean norm in the loss term of CHNMF,which will improve the robustness of the method.And the hypergraph regularization term is applied to the objective function,which can explore high-order geometric information among more sample points.Then,the half-quadratic optimization(HQ)technique is adopted to solve the complex optimization problem of CHNMF.Finally,clustering,feature selection and gene co-expression networks are performed on the pan-cancer data,which can help the systematic research of cancer.(4)Aiming at the problem that the NMF integration model has insufficient flexibility in mining homogeneous information,a multi-view non-negative matrix factorization method based on graph regularization(GMvNMF)is proposed.The traditional NMF integration model is improved and decomposed into shared basis matrix,subspace transformation matrix and shared coefficient matrix,which improves the flexibility of the model.Then,the graph regularization term is introduced into the objective function,which can maximize the utilization of information in the data.Finally,GMvNMF is used to analyze different data types of the same cancer in TCGA.And it can make full use of the complementary information between different data types,which will further provide new ideas for cancer research at the molecular level.(5)Aiming at the problem that the existing single-cell analysis method cannot accurately construct the cell-to-cell similarity matrix,a Laplacian regularized low-rank representation method based on Cauchy loss(CNLLRR)is proposed.First,the Cauchy Loss Function(CLF)is applied to punish the noise matrix,which will improve the robustness of CNLLRR to noises and outliers.To efficiently encode local manifold information of the data,the graph regularization term is applied to the objective function.Further,these will guarantee the quality of the cell-to-cell similarity matrix learned.Finally,CNLLRR is used to single-cell datasets,which contributes to understand the heterogeneity between cell populations in complex biological systems.Various experimental results show that the proposed methods effectively consider the manifold information or noises and outliers in the data.They are not only better than other similar methods,but also have better clustering and feature selection effects. |