Font Size: a A A

Application Of P-norm-based Sparse Models In The Analysis Of Cancer Sequencing Data

Posted on:2022-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y SongFull Text:PDF
GTID:2514306323484894Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The development of new generation gene sequencing technology has produced a large variety of cancer sequencing data.How to make full use of these data to study the relationship between cancer and gene at the molecular level is particularly critical for cancer diagnosis and treatment.The typical characteristics of cancer sequencing data are"high dimension,small sample",and there are a lot of noise and outliers in the data,but only a few genes are involved in cancer lesions.Sparse representation theory and methods play an important role in the analysis of cancer sequencing data.With the deepening of research,there are new requirements for the analysis method of cancer sequencing data.For example:(1)How to improve the sparsity ability of the model to make the selected differentially expressed genes have more biological significance.(2)How to reduce the sensitivity of the model to outliers and enhance the robustness of the model.(3)How to effectively reduce the dimension of nonlinear data,so that the data is not easy to lose important information.(4)How to improve the generalization performance of the model.It is important to explore more suitable sparse modeling methods for analyzing this kind of data.To solve these problems,based on previous studies,this thesis proposes three improved sparse models based p-norm,and applies them to cancer sequencing data.The results show that the robust performance,sparse performance and generalization ability of the model are improved.It includes the following three aspects:(1)A graph regularized sparse model(PL21GPCA)based on nonconvex Lp-norm and L2,1-norm constraints is proposed.The model uses nonconvex Lp-norm to replace the traditional Frobenius-norm in the error function,so as to reduce the influence of noise and outliers in the data.Then L2,1-norm is used to improve the sparse expression of differential genes.Finally,the existence of graph regularization preserves the internal geometric structure of data.The clustering results on lung cancer dataset and cancer gene expression datasets verify the effectiveness of the method.In addition,this method can also find some pathogenic genes related to cancer by discovering gene network module.(2)Two dual graph regularized sparse models based on nonconvex Lp-norm and L2,p-norm are proposed,including DGPPCA and DG2PPCA.For DGPPCA,in order to improve the sparsity of the model,L2,p-norm constraints are imposed on the projection matrix,so that DGPPCA can adapt to different datasets when p value changes in the range of(0,1).The introduction of dual graph regularization can consider the original manifold structure of gene and sample together.As an extension of DGPPCA,DG2PPCA introduces nonconvex Lp-norm into the error function to improve the robustness of the model.The two methods were applied to single-cell RNA sequencing datasets for bi-clustering analysis.The experimental results show that these two methods can find the"chessboard"structure of bi-clustering,and have good performance in sample clustering and gene clustering.(3)A robust sparse model based on weighted Schatten p-norm and L2,p-norm(L2,p-WSRPCA)is proposed.The model uses weighted Schatten p-norm and L2,p-norm constraints to improve the traditional robust principal component analysis.Firstly,the weighted Schatten p-norm is applied to the recovery of low rank matrix,and in order to improve the recovery effect of the model,different singular values are contracted to different degrees.The uncertainty of p value can enhance the generalization ability of the model.Then,by using the feature of row sparsity generated by L2,p-norm,the noise matrix is sparsely constrained to obtain more sparse solutions.Finally,this method is used to perform the comparative experiments of sample clustering and feature selection on single-cell RNA sequencing datasets to verify the performance of L2,p-WSRPCA method.
Keywords/Search Tags:Sparse model, Nonconvex L_p-norm, L2,p-norm, Weighted Schatten-p norm, Cancer sequencing data
PDF Full Text Request
Related items