Font Size: a A A

Research On Robust Double-constrained Matrix Factorization Method And Its Application In Gene Sequencing Data

Posted on:2022-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:C Y WangFull Text:PDF
GTID:2510306323986579Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of biosequencing technology has enabled people to measure increasing amounts of genomics data.These omics data contain subtle changes in individuals or cells.Mining and researching on them can be used to explore the mechanisms of diseases,quantify differences in cells,and build life systems.Genomics data usually have the characteristics of high-dimensionally small samples,which pose a big obstacle to the downstream analysis of samples.As an effective data dimensionality reduction method,matrix factorization has received extensive attention from scholars.Sequencing data will inevitably have some noise or outliers,but the performance of traditional matrix factorization methods may be reduced when faced with these noise or outliers.This paper aims to improve and perfect the original Non-negative Matrix Factorization(NMF)model and Low-rank Representation(LRR)model,and apply two models to reduce the dimensionality of genomics data.The specific research contents are as follows:(1)Aiming at the characteristics of non-Gaussian noise and inherent manifold structure in the genomics data,a sparse robust non-negative matrix factorization method(SGNMFC)based on correlation entropy is proposed.This method replaces the traditional Euclidean distance with an entropy function to improve the robustness of the algorithm.At the same time,the manifold structure of the data is encoded to obtain spatial information between data points.Finally,the SGNMFC model is used for the feature gene extraction and sample clustering of cancer data,which provides more theoretical basis for the systematic research of cancer.(2)Considering that part of the genome data is polluted by noise and has a lot of redundant features,a sparse robust non-negative matrix factorization method based on Huber loss(Huber-SGNMF)is proposed.This method replaces Euclidean distance with Huber loss to improve the robustness of the algorithm.Huber loss can automatically identify whether the data is contaminated and decide to use L1 norm or L2 norm constraints on the data.Furthermore,the L2,1 norm constraint item is added to the feature matrix to remove a large number of redundant features and obtain a matrix with row sparsity.In the end,the algorithm verifies the clustering performance and the performance of extracting differentially expressed genes in the integrated dataset of multiple cancers,and provides some help for finding the association between multiple cancers.(3)Aiming at the problem of fuzzy edge of different cell populations and fuzzy construction of cell similarity,a low-rank matrix factorization method based on adaptive total variation(ATV-LRR)is proposed.Firstly,the method uses a LRR method to reconstruct the low-rank subspace structure of the original data,and learns the similarity information of the cells in the subspace.Then,an adaptive total variational constraint is added to remove the noise of the same type of cell data and learn the edge characteristics of different cell populations.Finally,the method is applied to single-cell data to learn the heterogeneity between cells and divide cell populations.(4)Aiming at the problem that a single clustering model cannot be adapted to different biological data,a low-rank matrix decomposition method based on the subspace ensemble framework LRSEC is proposed.This method takes a LRR model as a basic learner.The LRR method can map cells of the same type in the same subspace and learn the similarity of pairs of cells.Then,the ensemble model is constructed in multiple individual learners to avoid the limitation of unstable performance of a single model on different datasets.Finally,the method is used to cluster cells in a single-cell data set and extracts gene markers,which is conducive to understanding the complex cell network in biological systems.These models put forward in this paper,have been applied to genomics data.Experimental results demonstrate that these methods are more robust to noise in the data and can effectively learn the manifold structure information in the data.They not only have good clustering and feature selection performance,but also superior to the existing similar methods.
Keywords/Search Tags:Non-negative Matrix Factorization, Low-rank Representation, Correntropy, Huber Loss, Ensemble Learning
PDF Full Text Request
Related items