| Since the implementation of the Human Genome Project(HGP),it has received extensive attention,especially the gene sequencing technology.With the emergence of Next Generation Sequencing(NGS)technology,especially the vigorous development of single-cell sequencing technology,a large number of single-cell RNA sequencing(sc RNA-seq)data have been found.These data contain rich biological information,which provides great help for scientists to explore cell differentiation,discover cell heterogeneity and identify cell subtypes at the single-cell level.Due to the limitations of sequencing technology,sc RNA-seq data is usually characterized by high dimension,high noise and high sparsity,which brings great difficulties and challenges to analysis.The matrix factorization algorithms are commonly used data dimension reduction methods,which have been widely used in many fields.However,when traditional matrix factorization algorithms face with sc RNA-seq data,due to the characteristics of the data and the limitations of the methods,the effect is not ideal.Therefore,in this thesis,the traditional Non-negative Matrix Factorization(NMF)model and Low-rank Representation(LRR)model are improved and optimized for sc RNA-seq data.The specific research contents are as follows:(1)Aiming at the non-Gaussian noise and the higher order spatial structure in sc RNA-seq data,a Cauchy Robust Hyper-graph Laplacian Non-negative Matrix Factorization(CHLNMF)method is proposed.In this method,the Euclidean distance in the traditional NMF model is replaced by Cauchy Loss Function(CLF)to reduce the noise sensitivity of the model.Then,hypergraph regularization is added to the model to characterize the high-order spatial relationship among multiple samples,and fully learn the manifold structure among samples.Finally,the method is applied to sc RNA-seq data for sample clustering and gene markers selection.(2)Aiming at the characteristics of sc RNA-seq data,such as high noise,fuzzy boundary between different types of cell clusters,and local inherent manifold structure,a novel method called Robust Manifold Low-rank Representation with Adaptive Total-variation Regularization(MLRR-ATV)was proposed.In this method,firstly,the Adaptive Total-variation(ATV)model is added to the LRR model to reduce the noise interference between same kind cells and learn the boundary features between different cell types through gradient.Then,the local inherent linear and nonlinear manifold structures in the data are learned by normalizing Euclidean distance and cosine similarity.Finally,the method is applied to sc RNA-seq data to learn cell heterogeneity,cluster cells and screen gene markers.(3)Aiming at the phenomenon that different single-cell clustering methods have different advantages due to different emphasis,and the initialization variables setting sensitivity of the LRR methods,a Robust Dual Ensemble Clustering Method Based on Low-Rank Representation(DELRR-sc)is proposed.The ensemble clustering can achieve higher performance by integrating the learning results of different methods.In this method,the advantages of four excellent clustering methods based on LRR model are absorbed.In the two-layer integration framework,the first layer is used to solve the problem that a single model is sensitive to the setting of initialization variables.In the second layer,a scoring strategy based on Silhouette Coefficient is designed as a weight to integrate the learning results of the four methods.Finally,the final results will be used in cell clustering and gene markers selection,which is conducive to better exploring the process of cell differentiation.The methods proposed in this thesis have been applied to sc RNA-seq data.The experimental results show that the methods proposed in this thesis can obtain better clustering results and comprehensive performance compared with existing classical methods. |