| In recent years,single-cell sequencing technology has been widely used in various fields of biology as an effective tool.Through single-cell sequencing technology,a single cell can be represented as a high-dimensional vector,and thousands of cells form a high-dimensional matrix.Through the analysis of this matrix,it is possible to explore the differences in chromatin accessibility,gene expression,etc.between different cell types,the same cell type in different tissues,patient cells and normal human cells,and then provide important theories for disease diagnosis,cell differentiation and other research.As single-cell sequencing technologies generate more and more data,so does the need for biologists for data analysis algorithms.Considering the effectiveness of low-rank matrix factorization and sparse optimization in the field of data mining,this thesis focuses on how to use low-rank matrix factorization and sparse optimization techniques to mine useful information from single-cell sequencing data.Single-cell RNA sequencing(scRNA-seq)is a sequencing technology that detects gene expression in single cells.As a fundamental step of scRNA-seq data(gene expression matrix)analysis,cluster analysis determines the accuracy of subsequent analysis to a certain extent.There are already some algorithms for clustering analysis of scRNAseq data,but the clustering accuracy of these algorithms is still very limited.There is an urgent need to improve existing algorithms or develop new algorithms to improve clustering accuracy.Aiming at this problem,in Chapter 3,we propose a cluster analysis algorithm-scSO for scRNA-seq data based on low-rank matrix factorization and sparse optimization.In scSO,we combine the non-negativity and sparsity of scRNA-seq data under the assumption that the expression levels of the same type of cells are approximately linearly correlated,and propose a scRNA-seq dimensionality reduction method based on non-negative matrix factorization,and then apply Gaussian mixture models and the Bayesian formula to measure the similarity between cells,and finally a spectral method based on sparse optimization is proposed to determine the final cell clusters.In scSO,in order to quickly and efficiently compute the sparse non-negative matrix factorization of the gene expression matrix with low memory consumption,we improve the existing alternate iterative algorithm and analyze the convergence of the new algorithm.In spectral analysis,we prove that the eigenvalues of Laplacian matrices of similarity matrices are piecewise linearly distributed,and propose an algorithm for multiplicity estimation of 0 eigenvalues of Laplacian matrices of similarity matrices based on sparse optimization.Tests on multiple benchmark datasets validate the performance of scSO.Compared to the current state-of-the-art and widely used algorithms(Seurat,SC3,etc.),the number of cell cluster predicted by scSO is closest to the reference value,and most cells are correctly classification.Despite the rapid increase in the application of scRNA-seq technology,the number of genes detected per cell is still limited by technical challenges.These technical limitations may result in a cell expressing a certain gene,but the gene cannot be detected in the cell.This phenomenon is called dropout events.Dropout events cause many zero elements in the gene expression matrix which do not represent the true expression level of genes in the cell,and seriously affect the analysis of scRNA-seq data(such as cluster analysis).In this context,Chapter 4 of this thesis investigates how to efficiently recover missing information from highly sparse gene expression matrix.Considering the difference in the amount of information contained in the zero elements and non-zero elements in the gene expression matrix,we propose a biased low-rank matrix decomposition method—WEDGE(WEighted Decomposition of Gene Expression)to impute the lost information in the gene expression matrix.In order to solve the low-rank matrix factorization of gene expression matrix quickly and efficiently,we propose an alternate iterative algorithm,and prove that the algorithm converges under certain conditions,and its time complexity is linear with the number of cells.Experiments show that WEDGE can successfully impute gene expression,reproduce the cell-cell and gene-gene correlation,and improve cell clustering.The regulation of chromatin structure and gene expression underlies key developmental transitions in cell lineages.With the development of single-cell sequencing technology,several single-cell multi-omics sequencing technologies have emerged in recent years,which can simultaneously sequence chromatin accessibility information and RNA expression in a single cell.The emergence of these techniques provides a reliable basis for studying the regulatory relationship between chromatin accessibility and gene expression.Although several methods have been developed to identify cis-regulatory elements using single-cell multi-omics data containing chromatin accessibility information and RNA expression,these methods are not applicable to singlecell multi-omics data with relatively poor data quality.To provide biologists with a method for cis-regulatory element identification applicable to single-cell multi-omics data of varying data quality,in Chapter 5,on the basis of related research on DORC,gene score,chromatin co-accessibility,etc.,we propose a method based on sparse optimization and non-negative least squares fitting to identify cis-regulatory elements using single cell multi-modal data-STARRY(Single cell to Accessibility Regulation Related Yield).Experiments show that STARRY outperforms existing methods in accuracy on both high-quality and poor-quality multi-modal data.In general,the research content of this thesis aims to solve the core problems in single-cell data analysis using sparse optimization and matrix low-rank approximation,focusing on scRNA-seq data clustering,imputation of missing elements in sparse scRNA-seq data,identification of cis-regulatory elements and other fundantmental problems in single-cell data analysis.In the solution of specific problems,we insist on starting from the characteristics of single-cell data and the needs of biologists,and provide reasonable mathematical models and effective solving algorithms for the research of this problem.In addition,the research in this thesis provides a research basis for refined disease diagnosis,cell differentiation,and etc. |