Font Size: a A A

Structured Sparse Methods And Their Application In Cancer Genomics

Posted on:2018-11-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:W W MinFull Text:PDF
GTID:1364330542966598Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of high-throughput techniques,we can more easily obtain multiple omics data from some cancers.This thesis mainly studies how to design efficient structured sparse models and algorithms to identify biomolecule co-expression modules based on the cancer genomics data.This thesis focuses on the integrated analysis of cancer genomics data from many aspects,but each has its own emphasis.The main contributions and innovations of this thesis are as follows:(1)For integrated analysis of cancer gene expression data and gene network struc-ture information,we proposed a novel sparse Network-regularized Singular Value De-composition(SVD)with Absolute operator(ANSVD)framework.The.key of ANSVD is to impose a novel network-regularized penalty(|u|L|u|).We proposed a trick to cleverly remove the absolute operator and designed an efficient alternating iterative al-gorithm to solve it.The results on two real cancer gene expression data showed that our method can discover more biologically interpretable expression patterns by integrating the prior gene interaction network.(2)For integrated analysis of cancer gene expression data and gene group structure information,we proposed multiple group sparse SVD models and algorithms.We first proposed group-sparse SVD models with group Lasso(GL1-SVD)and group L0-penalty(GL0-SVD).An alternating iterative algorithm based on a block coordinate descent method was proposed to solve GL1-SVD,and an alternating iterative algorithm based on a projection method was proposed to solve GL0-SVD.Moreover,we also considered another class of group sparse SVD models with Overlapping Group Lasso(OGL1-SVD)and Overlapping Group L0-penalty(OGL0-SVD).The key of solving OGL1-SVD is a proximal operator with overlapping group Lasso penalty.We proposed an alternating direction method of multipliers(ADMM)to solve the proximal operator.Similarly,the key of solving OGL0-SVD is also a proximal operator with overlapping group L0-penalty.We proposed an approximate method to solve it.Finally,we tested our methods by integrating cancer gene expression data and gene group structure from the KEGG pathways or gene interaction network.The results on multiple real cancer gene expression data showed overlapping group sparse SVD can overcome the shortcomings of traditional sparse SVD methods,and identify some gene co-expression modules with better biological interpretations.(3)For integrated analysis of multiple cancer gene expression data and gene network structure information,we proposed an Edge-group Sparse PCA(ESPCA)model.It enforces sparsity of principal component loadings through considering the connectivity of gene variables in the prior network.We developed an alternating iterative algorithm to solve ESPCA.The key of this algorithm is to solve a new k-edge sparse projection problem and a greedy strategy has been adapted to address it.Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously.By integrating gene network structure information,our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations.(4)For integrated analysis of multiple omics data of the same kind of cancer,we proposed two structured sparse methods.(a)We proposed a two-stage method.We first developed a multiple-output structured sparse regression model to predict a miRNA-gene association matrix.Further,we proposed a L0-regularized SVD(L0-SVD)to identify miRNA-gene joint modules from the predicted matrix.Finally,the two-step method was tested on breast cancer data from TCGA database and compared with related methods.(b)We proposed a novel Sparse Weighted Canonical Correlation Analysis(SWCCA).SWCCA can not only select the features of its input data matrices X and Y,but also select the samples of X and Y.We applied L0-SWCCA to synthetic data and real-world data to demonstrate its effectiveness and superiority compared to related methods.Lastly,we considered also SWCCA with different penalties like Lasso and Group Lasso,and extended it for integrating more than three omics data to identify multiple biomolecule co-expression modules.(5)For integrated analysis of cancer gene expression data,gene network structure information and patient survival time data,we proposed a novel network-regularized sparse Logistic Regression model with a Absolute Network-regularized penalty(Ab-sNet.LR).The model can integrate the gene network structure information to predict the survival risk of cancer patients.Compared with the traditional network sparse lo-gistic regression models,AbsNet.LR can effectively overcome the influence of Network-regularized penalty on the sign of the regression coefficient vector Qw.Finally,we test-ed our method.with two biological data by integrating the gene expression data,the normalized Laplacian matrix L encoding the gene interaction network and the clinical binary outcome for clinical risk prediction and biomarker discovery,and compared with related methods.
Keywords/Search Tags:Machine learning, Optimization, Structured sparse learning, Coordinate descent, ADMM, Computational cancer genomics
PDF Full Text Request
Related items