Font Size: a A A

Regression Analysis Of Interval-censored Data And Doubly-censored Data

Posted on:2016-10-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:P J WangFull Text:PDF
GTID:1220330473961750Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In recent years the analysis of interval-censored failure time data has attracted a great deal of attention (Finkelstein,1986; Klein and Moeschberger,2003; Sun,2006; Deng and Fang,2009). In this case, the failure time of interest is observed only to belong to an interval or a window instead of observed exactly or right-censored (Fang et al.,2011; Kalbfleisch and Prentice,2002). One field that often produces interval-censored data is medical follow-up studies such as clinical trials. In this situation, study subjects are usually given a set of prespecified clinical or observation times for checking the status or occurrence of a certain disease or medical condition. The analysis of the resulting data would be relatively straightforward if all subjects followed the prespecified observation times. However, it is well-known that this rarely happens and study subjects often miss some observation times or use different observation times, thus yielding interval-censored data on the occurrence time of the disease or medical condition. Other fields that often produce interval-censored data include demographical studies, economic and financial studies, epidemiological studies, social sciences and tumorigenicity experiments.In practice, there exist several different types of interval-censored data. One type that is common in demographical studies and tumorigenicity experiments among oth-ers is case I interval-censored data (Andersen and Ronn,1995; Huang,1996; Lin et al.,1998; Martinussen and Scheike,2002; Jewell and van der Laan,2004a,2004b; Sun, 2006). In this case, each subject is observed only once and the failure time of interest is known only to be smaller or greater than an observation time. In other words, the failure time is either left- or right-censored and one only observes that if the survival event of interest has occurred before the observation time. To give an example, con-sider a tumorigenicity experiment concerning the occurrence rate of a tumor. In these situations, the observation time is usually the time at which an animal dies or is sacri-ficed and the failure time of interest is the tumor onset time. Case I interval-censored data commonly occur here since in general the occurrences of tumors can be observed only at the death or sacrifice (Hoel and Walburg,1972).In addition to censoring, truncation is another special feature of failure time data and it is apparent that its existence makes the analysis of the data more complicated. In the following, we will focus on case I interval-censored data in the presence of left-truncation, which occurs if a subject has to satisfy certain conditions or experience some initial or preliminary event to be included in a study. Several authors have discussed the analysis of left-truncated and case I interval-censored data (Joly et al., 1998; Pan and Chappell,1998,1999,2002; Kim,2003; Shen,2014). Especially, Kim (2003) and Pan and Chappell (2002) considered regression analysis of these data under the proportional hazards model. It is well-known that sometimes the proportional hazards model may not fit the data well or be appropriate and the additive hazards model provides a useful alternative (Lin et al.,1998; Kulich and Lin,2000; Zhou and Sun,2003). In the following, we consider regression analysis of left-truncated and case I interval-censored (LTIC-Ⅰ) data arising from the additive hazards model.Consider a failure time study that involves n independent subjects. For subject i, let Ti denote the failure time of interest and suppose that there is a vector of covariates denoted by Zi, i= 1,..., n. Also suppose that for each subject, there is a left-truncation time Xi such that Xi≤ Ti and the subject is observed only at a single time point Ui (Ui> Xi). As the result, one only knows Ti≤ Ui or Ti> Ui. Define δi= I(Ti≤ Ui), i= 1,...,n. Then the observed data have the form In the following, suppose that the main objective is to make inference about the effects of the Zi’s on the Ti’s.To describe the covariate effects, we assume that given Zi, the hazard function of Ti has the form where λ0(t) denotes the unknown baseline hazard function and β is a vector of re-gression parameters. That is, the Ti’s follow the additive hazards model (Lin and Ying,1994). Define Λ0(t)=∫0t λ0(s)ds, the baseline cumulative hazard function, and S0(t)= exp{-Λ0(t)}, the baseline survival function. Assume that given Zi, Xi and Ui are independent of Ti. Then the likelihood function of β and Λ0 is proportional to and the corresponding log likelihood function has the form with respect to the baseline cumulative hazard function Ao(t).To consider the estimation of β and Λ0, it is apparent that a natural approach is to maximize the log-likelihood function ln directly. On the other hand, it is well-known that the maximization is not easy or straightforward due to the dimension of Λ0(t). To deal with this, following Huang and Rossini, (1997) and others, we employ the sieve approach that approximates the baseline cumulative hazard function Λ0(t) by using linear functions.Let 0= t0< t1<...< tqn= T denote a partition of the observation interval [0,T], where T denotes the largest follow-up time. Here qn, often called the sieve number, is usually set to be an increasing integer along with n at the rate O(nκ) with 0< κ< 1/2. Define Hn to be the the set of all linear functions that have the form with An(t)≤ M for 0≤ t≤T. Here Il(t)=I(tl-1< t≤ tl), M is a constant, and 0= h0≤h1≤ h2≤...≤ hgn≤ M are unknown parameters. It is easy to see that Λn(tl)= hl, l= 0,1,..., qn. By following the sieve approach, we can estimate β and Ao by maximizing ln(β,Λ0) over β and the hi’s on Θn= B × Hn, where B denotes the parameter space for β. On the other hand, in practice, it is more convenient to reparameterize the hi’s by hl= Σk=1leγκ to remove the range limitation, where γ= (γ1,...,γqn)T are some unknown parameters. With respect to 7, we can rewrite Λn(t) asWith the use of the parameters γl’s, we propose to estimate 0 and Ao by θn= (βn, Λn) defined as For the determination of θn, it is apparent that a natural method is to solve the following score equations and where To establish the asymptotic properties of θn, define and Let V= (X, U), Γ(v)= Λ(u)-Λ(x), Γn(v)= Λn(u)-Λn(x), and G(v) denote the joint distribution function of V. Also define the distance d on Θ as In the following, we first give the efficient score function and information matrix for β and then present the results on the convergence and asymptotic normality.Theorem 1. Under some regularity conditions, we have (a) The efficient score function for β is given by (b) The information matrix for β has the form where a⊕2= aa’ for a ∈ Rd. Theorem 2 (Consistency). Under some regularity conditions, we have Theorem 3 (Rate of convergence). Under some regularity conditions, we have where r defined in the regularity condition.Theorem 4 (Asymptotic normality and efficiency). Suppose that the true value β0 of β is an interior point of B and some regularity conditions hold. Also suppose that 1/4r< κ< 1/2. Then we have where Pn denotes the empirical measure of { Yi= (δi,Xi,Ui,Zi); i=1,..., n}.Note that Theorems 2 and 3 say that the estimate θn is not only consistent but also can be optimal by taking κ= 1/(1+2r). In this case, the convergence rate is nr/(1+2r),equal to n1/3 or n2/5 for r= 1 or 2, respectively. Theorem 4 indicates that the estimate βn is also efficient and one could estimate its covariance matrix by using the information matrix. In practice, we propose to estimate the covariance matrix of βn by the upper left sub-matrix of ∑n-1 corresponding to β, where ∑n denotes the negative Hessian matrix of n-1 ln(β,Λ) evaluated at (βn, Λn).We considered regression analysis of left-truncated and case Ⅰ interval-censored (LTIC-Ⅰ) data arising from the additive hazards model above. In practice, there exist another type of interval-censored data, called case Ⅱ interval-censored data. Different from case Ⅰ interval-censored data, there are more than one observation time for each subject. Many authors have investigated regression analysis of interval-censored failure time data. For relatively complete and recent references, the readers are referred to Sun (2006) and Chen et al. (2012). In the following, we consider regression analysis of left-truncated and case Ⅱ interval-censored (LTIC-Ⅱ) data arising from the additive hazards model.Consider a failure time study that involves n independent subjects as well as both left-truncation and interval-censoring. For subject i, let Ti denote the failure time of interest and suppose that there is a vector of covariates denoted by Zi, i=1,...,n. Also let Xi denote the left-truncation time and Ui and Vi the interval-censored observation times associated with subject i such that Xi< Ui< Vi and one only knows Xi< Ti< Ui ,Ui<Ti≤ Vi or Ti> Vi. Define δ1i= I(Ti≤ Ui),δ2i=I(Ui<Ti≤ Vi), δ3i= I(Ti> Vi), i= 1,...,n. Then the observed data have the formIn the following, we suppose that the main objective is to make inference about the effects of the Zi’s on the Ti’s.To describe the covariate effects, we will assume that given Zi, the hazard function of Ti follows the additive hazards modelwhere λ0(t) denotes an unknown baseline hazard function and β is a vector of regression parameters. Define A0(t)= ∫0tλ0(s)ds, the baseline cumulative hazard function, and S0(t)= exp{-Λ0(t)}, the baseline survival function. Then the survival function of Ti has the form S(t)= S0(t) exp{-βZit}.In the following, we will assume that given Zi, (Xi, Ui,Vi) are independent of Ti. Then the conditional likelihood function of β and Λ0 given Ti≥Xi can be written asThe resulting log-likelihood function has the formIn the following, we will focus on the inference about regression parameters β. Similar with the method above, we propose to employ the sieve approach that approx-imates Λ0(t) by using linear functions. Let 0= t0< t1<...< tqn= τ denote a partition of the observation interval [0,τ], where τ denotes the largest follow-up time. Here qn is usually called the sieve number and set to be an increasing integer along with n at the rate O(nκ) with 0< κ< 1/2. Define Hn to be the set of all linear functions with Λn(t)≤ M for 0≤ t≤ τ. Here Il(t)= I(tl-1< t≤tl), M is a constant, and γl’s are unknown parameters. Focusing on the space Θn= B × Hn, where B denotes the parameter space for β, we have a finite parameter estimation problem compared to the original estimation problem. Also it can be easily shown that as n '∞,Θn converges to the original parameter space Θ and thus Θn can be used as a sieve space.For estimation of β and Λ0, we propose to use the estimator θn= (βn, Λn) defined as For the determination of θn, it is apparent that a natural method is to solve the following score equations and whereIn the following, we will establish the consistency of θn and the asymptotic nor-mality and efficiency of βn. For this, let G(x, u, v) denote the joint distribution function of (Xi, Ui,Vi) and define the distance where All limits below are taken as n'∞.Theorem 5. Suppose that the regularity conditions hold. Then it can be shown that the information matrix Ⅰ(β) for β is a positive definite matrix with finite entries. Theorem 6 (Consistency and Rate of convergence). Suppose that the regularity conditions hold. Then we have in probability. Furthermore, it can be shown that in probability with 0< κ< 1/2, r defined in the regularity condition and β0 being the true value of β.Theorem 7 (Asymptotic normality and efficiency). Suppose that β0 is an interior point of B and 1/4r< κ< 1/2. Also assume that the regularity conditions hold. Then we have in distribution, where Pn denotes the empirical measure of {Yi= (δ1i,δ2i, Xi, Ui, Vi, Zi);i= 1,..., n} and lβ* the efficient score function for β.Note that Theorem 6 suggests that the proposed estimator θn is not only consis-tent, but also can be optimal with setting κ= 1/(1+2r), which gives the convergence rate of nr/(1+2r),equal to n1/3 or n2/5 for r= 1 or 2, respectively. Theorem 7 tells us that one can approximate the distribution of βn by the normal distribution and its asymptotic covariance reaches the lower bound. That is, βn is efficient.To make inference about β based on the result above, one needs to estimate the information matrix Ⅰ(β). For this, we suggest to employ the profile likelihood method (Murphy and van der Vaart,1999). Specifically, let pln(β)= ln(β,Λβ) denote the profile log likelihood for β, where Λβ denotes A that maximizes ln(β,Λ) for given β. Also let h1,...,hd denote random variables and ei a d-dimensional vector of zeros except the ith element being equal to one, where d denotes the dimension of β. For i≠ j= 1,...,d, define and for i= 1,...,d. Then one can show that if hi'P0 and both hi/hj and ((?)nhi)-1 are bounded in probability,Iij converges in probability to the (i,j)th component of the information matrix Ⅰ(β). In other words,Ⅰ(β) can be consistently estimated by I=(Iij).Doubly censored failure time data often arise in many areas, especially in disease progression or epidemiological studies (De Gruttola and Lagakos,1989; Kim et al., 1993; Sun et al.,1999; Pan,2001; Sun et al.,2004). By doubly censored data, we mean that the failure time of interest represents an elapsed time between two related events, an initial event and a subsequent event, and the observations on the occurrences of both events can suffer censoring such as right censoring or interval censoring. It is apparent that if there is no censoring on the occurrence time of the initial event, the observed data become usual right-censored or interval-censored data (Kalbfleisch and Prentice,2002; Sun,2006).A well-known example of doubly censored failure time data is given by an acquired immune deficiency syndrome (AIDS) cohort study (Sun,2006). The study consists of the patients with hemophilia who were at the risk for Type-1 human immunodeficiency virus (HIV-1) infection due to the contaminated blood factor that they received for their treatment. The variable of interest is the time from HIV-1 infection to the diagnosis of AIDS, which is often referred to as the AIDS incubation time. Since the subjects were observed only periodically, the observations on both HIV-1 infections and AIDS diagnosis times suffer interval censoring and right censoring, respectively. In addition, only small number of the patients developed AIDS during the study. As usual, in the presence of heavy right censoring, a natural question is the possible existence of a cured subgroup.The standard failure time analysis usually assumes that all study subjects will experience or be susceptible to the event of interest. In practice, however, this clearly may not be the case as some subjects may never experience or not experience the event within a long period of time. In other words, there may exist some subjects who are not susceptible to the event of interest or there may exist a cured subgroup. To deal with this, a cure model is often employed and one of the early references on this is given by Farewell (1986), which proposed a logistic model to describe the cure rate and discussed the parametric analysis of the model. Following Farewell (1986), a number of other authors investigated the same problem under different set-ups (Lu and Ying,2004; Fang et al.,2005; Lam and Xue,2005; Ma,2009,2010). However, all of these methods are for either right-censored or interval-censored data. In the following, we discuss regression analysis of doubly censored data where there may exist a cured subgroup.Consider a failure time study that involves n independent subjects and two related events, an initial event and a subsequent event. For subject i, let Xi and Si denote the occurrence times of the initial and subsequent events, respectively, with Xi≤ Si and suppose that there exists a vector of covariates denoted by Zi,i= 1,..., n. Define Ti= Si-Xi, the elapsed time between the two events and also the failure time of interest, and suppose that for the Ti’s, one only observes doubly censored data given by {Wi=(Li,Ri,Ui,Vi,δ1i,δ2i,Zi);i=1,...,n}. In the above, Li and Ri represent the observed interval for Xi such that Li< Xi≤ Ri, Ui and Vi denote two observation times on Si, and δ1i= I (Si≤ Ui),δ2i= I(Ui< Si≤ Vi) with Li≤Ri≤Ui≤Vi. It is easy to see that if Li= Ri, the occurrence time of the initial event is known exactly and we have interval-censored data on the Ti’s. In the following, we will assume that given Zi, Xi and Ti are independent and also the censoring mechanism denoted by Li, Ri, Ui and Vi is noninformative (Sun,2006).To describe the possible cured subgroup, for subject i, define the indicator variable Yi= 0 if the subject i cured and 1 otherwise. It is apparent that given Yi= 0, we have δ1i= δ2i= 0. For the subject with Yi= 1, in the following, we assume that the hazard function of Ti has the formwhere λ0(t) is an unknown baseline hazard function and θ denotes the vector of regres-sion coefficients. That is,Ti follows the proportional hazards model. To complete the specification of the cured model, it will be supposed that the binary variable Yi can be modeled by the following logistic modelwhere a is an unknown intercept and β denotes a vector of regression parameters. Note that for the simplicity, we assume that the covariates that may have effects on Ti and Yi are the same. In practice, they may be completely different or have some overlap components and in this case, the proposed approach below is still valid. Under models (13) and (14), it is easy to see that the survival function of Ti has the form given Zi, where Λ0(t)=∫0tλ0(s)ds, the cumulative hazard function.Now we consider estimation of the regression parameters α, β and θ as well as the baseline cumulative hazard function Λ0(t). For this, note that as mentioned above, if the occurrence times X=x=(x1,..., xn) of the initial event was observed exactly, then one has interval-censored data for the Ti’s. In this case, for the estimation, a common method is to maximize the log likelihood function ln(α,β,θ,Λ0| x)= log Ln(α,β,θ,Λ0| x), where (Finkelstein,1986; Sun,2006).In practice, of course, we do not observe X exactly. Let H denote the nonparamet-ric maximum likelihood estimator (NPMLE) of the cumulative distribution function of the Xi’s based on interval-censored data on the Xi’s only. Then by using the profile likelihood idea and following Sun et al.(1999)and Sun et al.(2004),one can estimate the parameters by maximizing the following approximate log likelihood function where ai=∫LiRidH(xi),i=1,...,N.Note that here for the simplicity,we have assumed that the Xi’s follow the same distribution.It is easy to see that the maximization of ln(α,β,θ,Λ0,H) may be difficult or not straightforward due to the dimension of Λ0(t).To deal with this,following Huang and Rossini(1997)and others, we propose to employ the sieve approach that approximates Λ0(t) by linear functions.More specifically,let 0=t0<t1<...<tqn=T denote a partition of the obser-vation interval[0,T], where T denotes the largest follow-up time and qn,an increasing integer along with n,is usually called the sieve number.Define the linear functions with Λ(t)≤M for 0≤t≤T,where Il(t)=I(tl-1<t≤tl),M is a constant, and γ1,...,γqn are unknown parameters.It is easy to see that Λn(tl)=∑k-1l eγκ, l=1,...,qn.We will define the estimators of α,β,θ and Λ0(t)as the values if α,β, θ and the γl’s that maximize ln(α,β,θ,Λn,H)over α,β,θ and Λn(t).Let αn,βn,θ, and Λn denote the estimators of α,β,θ and Λ0(t) defined above, respectively.We will assume that H is a consistent estimator of H and also for simplic-ity,assume that H has finite support points,which is usually the case in clinic trials or follow-up studies(Sun et al.,1999).Theorem 8. Under the regularity conditions,the estimators are consistent and furthermore,we have that (?)n(ψn-ψ0)'N(0,I-1(ψ0))in distribution. Here ψn=(αn,βnT,θnT)T,ψ0=(α0,β0T,θ0T)T denotes the true value of ψ=(α,βT,θT)T, and I(ψ0) is the information matrix.For the determination of ψn and Λn,it is apparent that the direct maximization of ln(α,β,θ,Λn,H) can be complicated.To deal with this,by following Pan(2001),we propose to employ the multiple imputation approach as below.Let D be an integer. For each l(1≤l≤D),define {Xi=xli} be a random sample of size n generated from H given Xi∈(Li,Ri) and ψ(l) and Λ(l) the resulting estimators by maximizing ln(α,β,θ,Λn| x’ls). Then one can obtain the proposed estimators by and estimate the covariance matrix of ψn byHere I(l) denotes the covariance estimate of ψ(l) obtained through the negative second derivative matrix of ln(α,β,θ, Λn| x’ls). It can be easily shown that under suitable conditions and as D'∞, the above estimators will converge to the true parameters (Wang and Robins,1998).
Keywords/Search Tags:Additive hazards model, Interval-censored Data, Left-truncation, Sieve maximum likelihood estimation, Cure model, Doubly censored data, Multiple imputation, Pro- portional hazards model
PDF Full Text Request
Related items