Font Size: a A A

Subgroup Analysis And Feature Selection Methods For Some Different Models

Posted on:2021-03-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L LiuFull Text:PDF
GTID:1367330602980905Subject:Statistics major
Abstract/Summary:PDF Full Text Request
Recently,many researchers have paid much attention to identifying heteroge-neous subgroups.The precision medicine is a common application of subgroup analysis,which seeks to give precise medical treatments to heterogeneous sub-groups of patients.Owing to the diversity of patients in genes,environment,age and weight and so on,the individualized treatment for different subgroups can obtain precise medical effect(see Ma and Huang,2017).Another wide range of real-world applications is precision marketing.Heterogeneity of marketing strate-gies reflects the diversity of customers' consumption behaviours and preferences.Precision marketing offers personalized customer service and is used to help en-terprises increase their profits by identifying the different marketing subgroups(see You et al,2015).Thus,correctly identifying heterogeneous subgroups to enhance the effects is a significant issue.In Chapter 2,we study the subgroup identification in the heterogeneous addi-tive partially linear model.For extending from classical linear models to a general case,we are interested in the heterogeneous additive partially linear model for subgroup analysis.The heterogeneous additive partially linear model is flex-ible and widely used,which can combine both parametric and nonparametric components.This model structure allows an easier interpretation of the effect of each variable and avoids the curse of dimensionality.Moreover,our proposed framework is more generic,efficient,and adaptive for incorporating linearity,non-linearity and heterogeneity.As an extension of additive partially linear model,heterogeneous additive partially linear model contains the homogeneous linear components and subject-dependent additive components,but has no group infor-mation of subject-dependent additive components.Such a model is more flexible and efficient for addressing some special issues such as precision medicine and precision marketing.yi=xiT?+gi(Zi)+?i,i=1,…n,(1.1)where?=(?1,…,?q)T is the homogeneous coefficient vector,gi(zi)=gi0+?j=1p=1gij(zij)with 9i0?R being heterogeneous intercepts and gij?R(i,j?1)being unknown smooth functions.It is assumed that E[gij(zij)]=0(i,j? 1)for identification purpose.Our goal is to identify the subgroups of for j=0,1,…,p,such that gij is the same function in each subgroup,and then we further estimate the subgroup-specific additive functions gi and the homogeneous parameter ?.A polynomial spline smoothing is used to approximate the heterogeneous ad-ditive components,yi=xiT?+B(zi)+?i+?i,i=1,…n,(1.1)where ?i=(?i0,?i1T,..,?ipT)T ?RNnP+1 is the individual-specific spline coeffi-cient vector.A challenging problem is how to identify the subgroups when the number K of subgroups is unknown in advance.For linear model with unknown K,Ma and Huang(2017)used the concave pairwise fusion penalty approach to iden-tify heterogeneous subgroups.When n and p are large,however,this method is complicated and unstable since their implementation requires iteratively stor-ing and manipulating the entire np-dimensional parameters.The memory and computational cost can be extremely high.we convert the optimization of slope coefficients to an optimization of intercept coefficientsa new clustering method is developed to automatically identify subgroups.The procedure avoids solving co-efficient vector in each iterative step as in regression clustering procedures.Thus,this approach is rapid and computationally stable even if the sample size is large.Based on the clustered heterogeneous additive components,consistent estimators of the homogeneous parameters and subgroup-specific additive components are further obtained.Moreover,(?)-consistency and asymptotic normality for the estimators of the parametric components are established.In Chapter 3,we study capturing heterogeneity in repeated measures data by fusion penalty.Longitudinal or clustered data are commonly encountered in biomedical studies.For example,biomarkers are measured over time in longitu-dinal studies.The repeated measures of a biomarker on the same subject tend to be correlated.In clustered studies,health outcomes of subjects within the same cluster(e.g.,twins,families,or communities)are more alike due to shared genetic and/or environmental characteristics.In this paper,for ease of illustration,we will use the term "repeated measures" in a general sense to denote either the measures from multiple units within a cluster(repeated over space,e.g.,left and right eyes of the same person),or those on the same marker across time(repeat-ed over time,e.g.,longitudinal measures of blood pressure of the same subject).The correlation of repeated measures from the same subject or cluster needs to be accounted for to yield more accurate and efficient estimates.Traditionally such heterogeneity is modeled by either fixed effects or random effects.In fixed effects models,the number of degree of freedom for the het-erogeneity equals the number of clusters/subjects minus 1,which could result in less efficiency.In random effects models,the heterogeneity across different clusters/subjects is described by e.g.,a random intercept with 1 parameter(for the variance of random intercept),which could lead to oversimplification and bi-ases(shrinkage estimates).Our "fusion effects" model stands in between these two approaches:we assume that there are unknown number of different levels of heterogeneity,and use the fusion penalty approach for estimation and inference.To achieve an appropriate balance between accuracy and efficiency,we pro-pose a new approach in between the fixed effects and random effects models.In our model,we assume that the heterogeneity for each subject belongs to different groups.By penalizing the fusion effect(the difference between two subject-specific effects),we automatically group the subject-specific effects with-out knowing the group membership of the subjects in advance.We thus term our method as the "fusion effects" model.Our model is along the lines of Ma and Huang,adapting their method to the repeated measures data.Computationally,we propose an alternating direction method of multipliers algorithm(ADMM)to implement the estimating procedure,which has been used for solving a large class of convex optimization problems.We use concave penalties on the pairwise differences of the parameters.Such penalties include the smoothly clipped abso-lute deviations penalty(SCAD)and the minimax concave penalty(MCP),which enjoy the consistency property.In Chapter 4,we study the variable selection in the high-dimensional multi-response intersection model.Interaction feature screening for high-dimensional data is still a challenging issue,specially for the case when the variables are ultrahigh-dimensional and strongly correlated.For ultra-high dimensional multi-response interaction model,we propose a projection on the conditional set to screen the main effect and interaction variables when predictors are highly cor-related.We take previously selected main effect or interaction variables as the conditional information.We project the covariates and response vector onto the conditional set to select the active variables.By projecting the covariates,our proposed procedure can reduce the confounding effects from the previously se-lected main effect or interaction variables significantly,and handle the issue of missing hidden important variables and mischoosing unimportant variables.It is worth noting that the size of the conditional set cannot be too large,because it would distort the connection between predictors and responses and be computa-tionally complex.To this end,we give a threshold of the maximal cardinality of the conditional set to decide which variables are most related.Based on interaction pursuit via distance correlation(IPDC,Kong et al.,2016),we consider screening active interaction variables via partial distance correlation for ultra-high dimensional multi-response interaction model.Compared to iden-tifying important interactions directly,screening active interaction variables is computationally more efficient by reducing the computational cost from a fac-tor of O(p2)to O(p).Moreover,owing to inherited the advantage of IPDC,the new method does not require the weak and strong heredity assumption.After the screening step,we construct pairwise interactions with retained interaction variables,and use regularization methods to identify important main effects and interactions.Erom the theoretical properties and simulation studies,we see that our screening method performs well in the screening step and possesses the sure screening property that important interaction variables and main effects are re-tained in the model with asymptotic probability one.
Keywords/Search Tags:Subgroup, B-spline, heterogeneity, clustering, estimation consisten-cy, variable selection, high dimensional data, precision medicine, fusion penal-ty, interaction screening, projecting, sure screening
PDF Full Text Request
Related items