Font Size: a A A

Subgroup Analysis And Variable Selection For Biological Data Analysis

Posted on:2023-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ChengFull Text:PDF
GTID:1520307028470294Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advances in the DNA sequencing technology,bioinformatics has entered the data-exploding era and researchers are in an urgent need for high performance statistical methods in order to extract vital information from the enormous data.This paper is laser focused on two key problems in the analysis of complex genome-related sequencing data.In the first part we talk about the robust subgroup analysis of cancer heterogeneous data.Cancer heterogeneity plays an important role in the understanding of tumor etiology,progression and response to treatment.However,most of the existing studies share the limitation that they cannot accommodate heavy-tailed or contaminated outcomes and also high dimensional covariates,both of which are not uncommon in biomedical research.To alleviate this problem,we propose a robust subgroup identification approach based on M-estimators together with concave and pairwise fusion penalties.The proposed method is based on a data-driven foundation which means it needs no prior information about the latent subgroup structure.In fact,sparse-inducing penalties are considered for both latent heterogeneity factors and high-dimensional covariates,where the estimation is expected to achieve subgroup identification and variable selection simultaneously.The core solver of the algorithm is designed on an ADMM framework which means it is readily available to take advantage of modern multi-core computation resources.We innovatively implement the algorithm in a parallel manner and demonstrate the significant advantage,both in simulation and real data analysis,of our algorithm when dealing with large-scale subgroup analysis problems.On the contrast,a naive divide-andconquer strategy would fail due to its insufficient communications among data batches.The convergence property of the proposed algorithm,oracle property of the penalized M-estimators,and selection consistency of the proposed BIC criterion are carefully established.Many synthetic settings are simulated to help us validate the theoretical results and deepen our understanding of the proposed method comparing with other candidate ones.The analysis of TCGA brease cancer mRNA expression data demostrates that the proposed approach is promising to efficiently identify underlying subgroups in highdimensional data.We identified five potential subgroups with their various clinical and genome features such as survival curves,tumor size,and copy number variations being significantly different among these groups.In the second part of this paper we propose a logistic model for high dimensional functional compositional data in order to analyze the relationship between gut microbiome and colonizing status of multi-drug resistant bacteria(MDRB)after liver transplant operation.The gut microbiome has been shown to be closely related to human health.During the study researchers often take various samples for sequencing and identifying the microbiome,resulting naturally a set of trajectories describing this dynamic eco-system.The proposed model is based on the linear log-contrast model for the compositional data but with some advances:the model incorporates both scalar and functional covariates for better model flexibility.A set of basis functions are chosen to perform a low-rank approximation for both the functional covariates and their corresponding functional coefficients.In such a way we achieve dimension reduction for the infinitely dimensional functions,and the functional variable selection problem can then take a form as the group-wise variable selection.The resulting model takes the form of a logistic regression subject to grouping but on an affine subspace.We develop an algorithm to solve this specific problem:first we use the augmented lagrangian method to take into account the affine constraints in the objective function;then a local quadratic approximation is performed to transform the logistic regression to a weighted linear regression form locally.Then we repeat the MM step for multiple times to further relax the objective in order to get a closed form in this iterative algorithm.The convergence property of the proposed algorithm is established.Also the statistical properties of the estimators are given for several penalized regression problems.The Lasso estimator is shown to possess estimation consistency but with potentials to over-select unrelated covariates while the SCAD/MCP estimators are proved to have oracle properties.We also point out that the constraints in the objective function are actually on-point description of the underlying model hence the performance of the socalled constrainted estimator would be on par with the un-constraint estimator,if not better.Various simulations have been performed to provide the numerical evidence.Finally the proposed method is used to study the relationship between MDRB status and gut microbiome of patients before and after liver transplant operations.The analysis is conduct based on different biological level and variable selection approaches,which has shown consistent results in variable selection across these different levels,implying that the proposed method is promising for such studies.
Keywords/Search Tags:Regression models, Variable selection, M-Estimation, Subgroup analysis, Functional data analysis, Compositional data
PDF Full Text Request
Related items