Font Size: a A A

A Copula Model-based Method For Regression Analysis Of Dependent Current Status Data

Posted on:2019-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q CuiFull Text:PDF
GTID:1360330572452962Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In recent years,regression analysis of interval-censored data has attracted great attention.By interval-censored data,we mean that the failure time of interest can not be observed exactly but only known to occur in a time interval(Sun,2006;Chen et al.,2012).This kind of data occurs frequently in many fields,such as clinical trials,demographic studies,sociology,and tumorigenicity experiments.In general,interval-censored data includes Case I interval-censored data and Case II interval-censored data,and Case K interval-censored data.By Case I interval-censored data,also referred to as current status data,we know that each subject is observed only once and the failure time of interest occurs before or after the observation time.In other words,the failure time of interest is either left-censored or right-censored.The observed data have the form {C,? = I(T?C)}.By Case II interval-censored data,we only know the relationship between failure time T and the time interval(U,V),in the interval or smaller than U or larger than V.The observed data have the form { U,V,?1 = I{T ? U),?2 = I(U<T?V),?3 = 1-?1-?2}.Many authors have discussed regression analysis of current status data.For ex-ample,Huang(1996)proposed the maximum likelihood estimator for the proportional hazards model,and proved the asymptotic property.Rossini and Tsiatis(1996)dis-cussed the estimation procedure for proportional odds model.Lin et al.(1998)consid-ered the regression analysis for additive hazards model with use of counting process.Moreover,Martinussen and Scheike(2002),Chen et al.(2009),Wen and Chen(2011)also considered the regression analysis of current status data.But all the methods mentioned above are based on the assumption that the failure time is independent of the observation time or censoring time given covariate.However,the independence assumption may not be valid in many situations.There may exist some dependence between the failure time and the censoring time in current status data,which is often referred to as dependent or informative current status data.Tumorigenicity experi-ments in animals often produce dependent current status data,in which the failure time of interest is usually the time to tumor onset.One usually observes only cur-rent status data because the presence or absence of tumors can be determined only at the animal natural death time or killed time.As most tumors are between lethal or non-lethal,this implies that the tumor onset time and death time can be related.There exists a great deal of literature on the statistical analysis of tumor onset time data,but most existing methods are based on parametric models,for example,Dinse and Lagakos(1983),Lagakos and Louis(1988),Rai et al.(2002).Recently,some authors discussed the semiparametric regression analysis of Case I interval-censored data,for example,Ma et al.(2015),Zhao et al.(2015)and Xu et al.(2019)proposed a copula-based procedure to describe the relationship between the failure time and the observation time,but the copula model and the association parameter need to be known.However,as mentioned in Ma et al.(2015),when the association parameter ?is misspecified,the estimators for regression parameter ? cab be biased.The copula model-based approach plays an important part in correlated random variables.For example,Shih and Louis(1995)developed such estimation procedures for estimation of the association parameter based on bivariate right-censored data.Huang and Zhang(2008)conducted sensitivity analysis based on dependent right-censored data with proportional hazards model.Wang et.al(2008)used copula model for the es-timation of regression parameter and association parameter for bivariate current status data.In this paper,as all authors mentioned above,we consider the regression analysis of dependent status data under the assumption that the copula model is known,but the association parameter ? is unknown.We propose a two-step estimation procedure for the estimation of the regression parameters and the association parameter a.Regression analysis is a set of statistical processes for estimating the covariate impact on the failure time variable.Up to now,a lot of regression models have been proposed by statisticians.Among them,semiparametric models have parametric and nonparametric components,and are preferred by researchers because of its advantages.Commonly used semiparametric models in survival analysis include proportional haz-ards model,additive hazards model,linear transformation model and so on.The proposed two-step estimation procedure works well under these commonly used semi-parametric models.Consider a failure time study that involves n independent subjects.For subject i,let Ti denote the failure time of interest and Zi a p-dimensional vector of covariates,and suppose that there exist two potential observation or censoring times denoted by Ci and ?i,i=1,...,n.Here we assume that Ci may be related to Ti but(i is independent of Ti such as the administrative stop time.In the tumor example,Ci denotes the natural death time and ?i represents the sacrifice or study stopping time.We can only observe Ci = min(Ci,?i),?i= I(Ci??i)and ?i=I(Ti?Ci).Then the observed data have the form{Xi=(?i,?i,Ci,Zi),i = 1,...,n}.To describe the effects of covariates,we will assume that given the covariates Zi's,Ti and Ci follow the marginal proportional hazards models given by(Ma et al.,2015)?(T)(t|Zi)=?1(t)exp(Zi T ?c),(1)?(C)(t|Zi)=?2(t)exp(Zi T ?),(2)where ?1(t)and ?2(t)are unknown baseline hazard functions,and ?c is a p-dimensional vector of regression parameters.Let FT and FC denote the marginal distributions of the Ti's and the Ci's given covariates,respectively,and F the joint distribution of Ti and Ci.According to Sklar theorem(Nelsen,2006),there exists a copula function C?(u,v)defined on I2 =[0,1]×[0,1],satisfying F(t,c)= C?{FT(t),FC(c)},(3)If FT and FC are continuous,C? is unique.Conversely,if Ca is a copula and FT and FC are distribution functions,then the function F(t,c)defined by(3)is a joint distribution function with margins FT and FC Here,? is referred to as association parameter,representing the relationship between Ti and Ci.Define A1(t)= ?0 t ?1(s)ds and A2(t)= ?0 t ?2(s)ds,and let fC denote the marginal density function of the Ci's given covariates.Then we have FT(t)= 1-exp{-?i(t)eZiT?c},FC(t)= 1-exp{-?2(t)eZiT?},and fC(t)=exp{-?2(t)eZiT?} ?2(t)exp(ZiT?).Note that the conditional distribution of Ti given Ci and Zi is P(T ? t|C = c,Zi)=(?)C?(u,v)/(?)v|u=FT(t),v=FC(c)=m?{FT(t),FC(c)}.Furthermore the likelihood function can be written as where ? = {?T,?,?1(·)}T and ? = {?T,?2(·)}T.Now we will discuss the estimation and inference about models(1)and(2)with the focus on regression parameter ?c.For this,we will present a two-step sieve estimation procedure that first estimates model(2)and then model(1).More specifically,for the first step,note that for the observation time or censoring time Ci's,we have complete or right-censored data and thus it is natural to estimate ? and ?2 by the maximum partial likelihood estimator and Breslow estimator,respectively(Kalbfleisch and Prentice,2002).Let ? denote the maximum partial likelihood estimator of ? and ?2 denote the Breslow estimator of ?2,?2=(?,?2).Let Ni(t)= I(Ci?<t,?i=1)and Yi(t)= I(Ci?t).Given ?,it is common to estimate AC by which is usually referred to as the Breslow estimator(Breslow,1972;Andersen,1982).Then one can estimate the marginal distribution of the Ci's by Fc(t)= 1-exp{-?2(,)exp(ZiT ?)}.Given ? =(?,?2),for the second step,to estimate?,it is apparent that one could maximize the conditional likelihood function On the other hand,it is easy to see that this maximization can be difficult,because the dimension of A1(·)is infinite.To address this,by following Huang and Rossini(1997)and others,we propose to approximate A1(·)with monotone I-splines first before the maximization(Lu et al.,2007;Ramsay,1988).More specifically,let M denote a positive constant and {Ij(t)}j=1 m-kn the I-spline base functions with degree m and kn interior knots,where kn = o(nv)with 0<<0.5.Define(?)n={?n=(?cT,?,?1n)T}=B(?)Mn,where B = {(?cT,?)T ? Rp+1,|| ?c || + || ? ||<M} with || v || denoting the Euclidean norm for a vector v,Mn={?1n:?1n(t)= ?j=1 m+kn ?jIj(t),??0,?j ? 0,j =1,…,m +kn,t ?[0,uc]},with uc being the upper bound of all observation times {Ci:i =1,...,n}.It follows from Lemma A1 of Lu et al.(2007)that(?)n can be used as a sieve space of the original parameter space(?).Then we can estimate ? by the sieve maximum likelihood estimator,denoted by(?c,?,?ln(·)),defined as the value of ? that maximizes the conditional log-likelihood function l(?|?)=?i=1 n l(i)(?|?),over(?)n,l(i)(?|?))= ?ilog fC(Ci)+(1-?i)?i log[1-m?{ FT(Ci),FC(Ci)}]+?i?Ai log[m?{FT(Ci),FC(Ci)}]+?i(1-?i)log[FT(Ci)-C?{FT(Ci),FC(Ci)}]+(1-?i)(1-?i)log[1-FT(Ci)-FC(Ci)+ C?{FT(Ci),FC(Ci)}].As mentioned above,a main advantage of the estimation procedure proposed above over that given in Ma et al.(2015)is that the former does not require that the association parameter? is known.Also as discussed above,in general,the copula model and association parameter cannot be estimated without extra information.For the situation here,the extra information is given by the estimation of the marginal distribution FC in the first step,which can then be treated as being known.The estimators ? and ?2 have been studied by many authors and in particular,they are consistent(Kalbfleisch and Prentice,2002).Next we will establish the asymptotic properties of ?c.Theorem 1.Assume that the regularity conditions(B1),(B2)and(C1)-(C4)in Chapter 2 hold.Then as n??,we have that ?c is consistent and(?)n(?c-?0)converges to the multivariate normal distribution with mean zero,where ?0 denotes the true value of ?c.In the last section,we assumed that the failure time variable follows proportional hazards model,which assumes a covariate is multiplicative with respect to the hazard rate.But in some applications,this assumption is not valid,or we are interested in other covariate impact on the failure time variable.Additive hazards model is a good choice when the proportional hazards model can't fit the data well(Lin et al.,1998;Kulich and Lin,2000).Additive hazards model assumes that given covariate Zi,the hazard function of Ti follows:?(T)(t|Zi)= ?0(t)+ Zi T ?a,(4)where ?0(t)denote an unknown baseline hazard function,and ?a is a p-dimensional vector of regression parameters.The covariate has an additive impact,which describes the risk difference not the ratio.It is attractive because of the simple form and the simple inference procedure.In this section,we will assume that the failure time variable Ti follows additive hazards model(4),the censoring time Ci still follows proportional hazards model(2),and consider the regression analysis when Ti and Ci are dependent.Given covariates,the marginal distribution function of Ti is given by FT(t)= 1-exp{-?0(t)-Zi T ?at},where A0(t)= ?0 t ?0(s)ds,the marginal distribution function of Ci is the same as the last section.The two-step estimation procedure proposed in the former section can still be used for the estimation of regression parameter ?a.For the first step,we have complete or right-censored data for the observation time or censoring time Ci's and thus it is natural to estimate ? and ?2 by the maximum partial likelihood estimator and Breslow estimator,respectively(Kalbfleisch and Prentice,2002).Let? denote the maximum partial likelihood estimator of ? and ?2 denote the Breslow estimator of ?2,? =(?,?2).After estimating the marginal distribution of the Ci's by FC(t)= 1-exp{-?2(t)exp(ZiT ?)},for the second step or estimation of ?a and ?,given ?,it is apparent that it would be natural to maximize the conditional likelihood function L(?|?}.When we maximize the conditional likelihood,the procedure is similar to the last section.We employ the sieve approach to approximate A0(-)with I-spline functions first before the maximization(Ramsay,1988).More specifically,for a finite closed interval[a,b],let ?={xi}i=1 kn+2m with a = x1 =...= xm<xm+1<...<xkm++m<xkn+m+1=...=xkn+2m = bbeing a sequence of knots that partition[a,b].Given ?,the class of I-spline basis functions {Ii(t|m,x)}i=1 m+kn of degree m with kn knots is defined as Ii(t|m,x)=?a t Mi(u|m,x)du,where the functions {Mi(t|m,x)}i=1 m+kn are defined recursively as and for m>1.It is easy to see that the I-spline functions are monotone nondecreasing basis functions,and we can approximate the unknown cumulative baseline hazard function A0(t)by ?0n(t)= ?i=1 m+kn ?iIi(t),where ?i ? 0,i = 1,...,m + kn.Let M denote a positive constant and {Ij(t)}j=1 m+kn the I-spline base functions with degree m and kn interior knots,where kn = o(nv)with 0<v<0.5.Define(?)n ={?n=(?a T,?,?0n)}=B(?)Mn,where B={(?aT,?)T?Rp+1,|| ?a || + || ? ||<M} with || v || denoting the Euclidean norm for a vector v,Mn = {?on:?0(t)=?=1 m+kn ?jIj(t),?jIj(t),?j ? 0,j=1,...,m+kn,t ?[0,uc]},with uc being the upper bound of all observation times {Ci:i = 1,...,n}.It follows from Lemma A1 of Lu et al.(2007)that(?)n can be used as a sieve space of the original parameter space(?).In this way,an estimation problem about both finite-dimensional and infinite-dimensional parameters can be transferred into a simpler estimation problem that involves only finite-dimensional parameters.Then we can estimate ? by the sieve maximum likelihood estimator,denoted by(?a,?,?0n(·)),defined as the value of B that maximizes the conditional log-likelihood function l(?|?)over(?)n,where l(?|?)=?i=1 n l(i)(?|?),l(i)(?|?)= ?i log fC(Ci)+(1-?i)?i log[1-m?{FT(Ci),FC(Ci)}]+?i?i log[m?{FT(Ci),FC(Ci)}]+ ?i(1-?i)log[FT(Ci)-C?{FT(Ci),Fc(Ci)}]+(1-?i)(1-?i)log[1-FT(Ci)-FC(Ci)+ C?{FT(Ci),FC(Ci)}].Similarly,?a is consistent and asymptotically normal.Theorem 2.Assume that the regularity conditions(B1),(B2)and(C1)-(C4)hold.Then as n ? ?,we have that ?a is consistent and(?)n(?a-?0)converges to the multivariate normal distribution with mean zero,where ?0 denotes the true value of?a.In the last section,we assume that the failure time variable follows the linear transformation models,which provide a class of flexible models.Linear transformation models assume that given covariate Zi,the hazard function of Ti follows(Chen et al.,2002;Ma and Kosorok,2005):h(T)=-?iT Z + ?,(5)where h(·)is an unspecified strictly increasing function,?l is an unknown parameter,and the distribution function F? of the error term ? is known.Equivalently,the model(5)can be represented by S(t|Z)= g(h(t)+ ?l T Z)(6)where S(t | Z)denotes the survival function of T for a given Z,and g is a known continuous and strictly decreasing function.The main advantage of linear transformation models is their flexibility since they include many well-known regression models as special cases.For example,one can get the proportional hazards model by taking F? to be the extreme value distribution or setting g(t)= exp(-exp(t))and if ? follows the standard logistic,or we take g(t)=1/(1 + exp(t)),the above model will give the proportional odds model.In this section,we will assume that the failure time variable Ti follows the linear transformation models(5),the censoring time Ci still follows proportional hazards model(2),and consider the regression analysis when Ti and Ci are dependent.Given covariates,the marginal distribution function of Ti is given by FT(t)= 1-g{log(H(t))+?l T Z},where,H(t)= exp(h(t)),the marginal distribution function of Ci is the same as the former section.The two-step estimation procedure can still be used for the estimation of regression parameter ?l.For the first step,we have complete or right-censored data for the observation time or censoring time Ci's and thus it is natural to estimate ? and ?2 by the maximum partial likelihood estimator and Breslow estimator,respectively(Kalbfleisch and Pren-tice,2002).Let ? denote the maximum partial likelihood estimator of ? and ?2 denote the Breslow estimator of ?2,? =(?,?2).After estimating the marginal distribution of the Ci's by FC(t)=1-exp{-?2(t)exp(ZiT?)},for the second step or estimation of?l and ?,given ?,it is apparent that it would be natural to maximize the conditional likelihood function L(?|?).Before maximization,we propose to approximate the unknown nonnegative in-creasing function H(t)by I-splines Hn(t)=?i=1 m+kn ?iIi(t),where ?i?0,i =1,...,m+kn.We can estimate ? by the sieve maximum likelihood estimator,denoted by(?l,?,Hn(·)),defined as the value of ? that maximizes the conditional log-likelihood function l(?|?)over(?)n,where l(?|?)=?i=1 n l(i)(?|?),l(i)(?|?))= ?ilog fC(Ci)+(1-?i)?i log[1-m?{ FT(Ci),FC(Ci)}]+?i?Ai log[m?{FT(Ci),FC(Ci)}]+?i(1-?i)log[FT(Ci)-C?{FT(Ci),FC(Ci)}]+(1-?i)(1-?i)log[1-FT(Ci)-FC(Ci)+ C?{FT(Ci),FC(Ci)}].Similarly,?l is consistent and asymptotically normal.Theorem 3.Assume that the regularity conditions(B1),(B2)and(C1)-(C4)hold.Then as n ??,we have that ?l is consistent and(?)n(?l,?0)converges to the multivariate normal distribution with mean zero,where ?0 denotes the true value of?l.In order to make inference of the regression parameter,we need to estimate the covariance matrix of ?,one natural way would be to derive a consistent estimator but as can be seen in the proof of the theorem,such estimator would be too complicated to be useful.Thus instead we suggest to employ the following bootstrap procedure(Efron,1979).Let B denote a prespecified positive integer and for each b = 1,...,B,draw a simple random sample {Xi(b)=(?i(b),?i(b),Ci(b),Zi(b)),j=1,...,n} of size n with replacement from the observed data {Xi,i =1,...,n}.Let ?(b)denote the sieve maximum likelihood estimator of ? defined above based on the resampled data set{Xi(b),i = 1,...,n,b = 1,...,B}.Then one can estimate the covariance matrix of ?byVar(?)=1/B-1 ? b=1 =(?(b)-1/B?b=1 B ?(b))2.Similarly one can show that ? is consistent and asymptotically follows a normal dis-tribution too and estimate its variance by using the same approach.For the regression analysis of dependent Case I interval-censored data,we proposed a two-step estimation procedure,which is flexible and easy to be implemented.The proposed estimation procedure works well under the assumption that the failure time variable follows several commonly used semiparametric models.
Keywords/Search Tags:Dependent censoring, Case ? interval-censored data, Regression analysis, Copula model, Proportional hazards model, Additive hazards model, Linear transformation models
PDF Full Text Request
Related items