Font Size: a A A

Statistical Inference For A Class Of Integer-valued Time Series With Missing Data

Posted on:2015-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:B T JiaFull Text:PDF
GTID:1260330428983117Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The data in the analysis of time series is mainly obtained by researchers’ obser-vation or recording instruments. For a variety of reasons, it is frequent to encountera data set containing a considerable number of missing observations. For example,the case of missing data may occur in market research due to errors in data entry;in the monitoring of the atmospheric pollution, missing data appears when monitor-ing instrument is out of action; the unequal spacing of the stock index resulting frompublic holiday can also be regarded as missing data. This, if ignored, may afect theaccuracy of estimates and threaten statistical power in some situations. Therefore, itis necessary to study the missing data problem in the time series analysis.In this thesis, we study the missing data problem in non-negative integer-valuedtime series. This thesis is divided into three parts. In the frst part, we work at estimat-ing the parameters of the periodic integer-valued autoregressive process of order onewith period T (PINAR(1)T) in the presence of missing data. We use fve methods toestimate the parameters of this process, and their performances are compared via sim-ulations. In the second part, in order to handle integer-valued time series data varyingwith the season, we introduce a new periodic integer-valued time series process. Somebasic probabilistic and statistical properties of this process are discussed. Moreover,parameter estimation with complete data and missing data are also addressed. In thethird part, we introduce two nonparametric methods to impute missing values in a sta-tionary non-negative integer-valued time series, and their performances are comparedvia simulations and a real example.Here we introduce the main results of this thesis.The frst part is devoted to estimating the parameters of the PINAR(1)Tprocesswith missing data.In practice, non-negative integer-valued time series data varying with the seasonare fairly common, such as the monthly number of short-term unemployed people. Theperiodic integer-valued autoregressive process of order one with period T (PINAR(1)T) proposed by Monteiro et al.(2010) can be used to handle this type of series. In the first part, we investigate the problem on estimating the parameters of this process with missing data.The definition of the PINAR(1)T process is given bywhere(i) the binomial thinning operator " o " is defined as φloXl-1Δ∑i=1Xt-1Ui(t),φt=αj∈(0,1),t=j+KT,1≤j≤T,κ∨N0,{Ui(t)}is a periodic sequence of independent Bernoulli random variables with success probability P(Ui(t)=1)=φt;(ii){Zt})constites a periodic sequence of independent Poisson-distributed random vari-ables with mean υt,υt=λj for t=j+KT,1≤j≤T,κ∨∈N0,and Zl are independent of Xt-1and φtoXt-1.It is assumed that time-series data{Xi:i∈I)come from the process(1),where I={1,…,NT)and N is a positive integer number. For some reasons,We only observe partial data points{Xs:s∈S(?)I),let S={s1,…,sd),where si indicates the subscript of observed data and s1<…<sd.While the points{Xr:r∈R=I/S}are missing completely at random.Under these assumptions,we shall use vive methods to estimate the parameters θ=(α1,λ1,…,αT,λT).The first method is the conditional least squares method with no imputation.This method is to estimate the parameters by minimizing the sum of the squared error Q(θ),whereHere si-si-1-1The second method is the conditional maximum likelihood method with no im-putation. This method is to estimate the parameters by maximizing the conditional likelihood CL(θ),where Here Conditioning on Xsi-1=xsi-1,βsi,si-si-1oXsi-1is a binomial random variable with distribution B(xsi-1,βsi,si-si-1).(?)βsi,joZsi-j follows a Poisson distribution P((?)βsi,jvsi-j).This implies that the pi-step-ahead conditional distributiib is the convolution of a binomial variable and a Poisson variable.The third method is the subgroup mean imputation. The idea of this method is as follows. According to periodicity of the PINAR(1)T process (1), it is easy to define subgroups in the time series fXi: I E11; then the missing values are replaced by the rounded mean of their own subgroup observations.The fourth method is the imputation based on the likelihood. It is based on an iterative scheme. In the (i+1)-th iteration, generate M groups of candidate imputed values by using the recursive equation (1) and θi derived in the i-th iterative; calculate the conditional log-likelihood when θ=θi, and the series that provides the largest log-likelihood is regarded as the originally whole data set; using the selected series, we derive θi+1and update their values. Repeat the above process until some criteria is reached.The fifth method is the bridge imputation. It is suitable to the case that Xt and Xt+k are observed for some k>1, but all the points between them are missing. This method is also based on an iterative scheme, and imputes missing values according to the recursive equation (1). However, the principle of choosing final imputed values is different from the fourth method. A description of the algorithm is given as follows. In the (i+1)-th iteration, generate candidate values Xt+1,…, Xt-k-1, Xt+kl if the simulated value Xt+k coincides with Xt+k to keep this path, otherwise a new one is generated until the condition is satisfied; once an entire series has been simulated, we can get a new θi+1. Repeat the above process until some criteria is reached.We conduct a simulation experiment to compare the five methods. As far as the Bias and MSE are concerned, when the missing proportion is small, we’d better choose the conditional maximum likelihood method with no imputation and the subgroup mean imputation; when the missing proportion is large, the bridge imputation is the best choice.The marginal distribution of the PINAR(1)T process is Poisson distribution with periodic parameters. In the same period, the mean and the variance of Poisson dis- tribution are equal and this property is not always found in the real data. Thus we introduce the periodic geometric integer-valued autoregressive process of order one with period T (PNGINAR(1)T) as follows.Definition1The PNGINAR(1)T process is a sequence of random variables {Xt} defined by the following recursive equationwhere(i){Xt} has geometric marginal distribution G(vt/(1+vt)) with vt=μj,t=j+kT,1≤j≤T,κ∈Noii) the negative binomial thinning operator "*" is defined as φt*Xt-1=∑i=1Xt-1Wi(t) being{Wi(t)}a periodic sequence of independent random variables with geometric distribution G(φl/(1+φl)), φl=αj∈(0,1) for t=j+κT,1≤j≤T,κ∈N0;(iii){εt} is a sequence of independent random variables, for all t, i, m∈N, and1≤l≤t,εt and Wi(m) are independent,and εt and Xt-1are independent. According to Definition1,E(Xj+kT)=μj,Var(Xj+kT)=μj(1+μj),1≤j≤T,κ∈No.And we can derive the distribution of the random variable Et. Proposition1Let t=j+kT,1≤j≤T,κ∈N0.Then the probability massfunction ofεt in(2)iswhere0<αj<min{1,μj/(1+μj-1)),μ0=μT.By Proposition1,the distribution of εj+kT is a mixture of G(μj/(1+μj))and G(αj/(1+αj)).Now we consider the moment Conditional moment and obtain the following result.Proposition2Suppose Xt, Xt+h,Xt-h come from the process(2),where t=j+κT,h=i+mT,1≤i,j≤T,m,κ∈No. Letγt(h)=Cov(Xt,Xt+h), γt(-h)=Cov(Xt,Xt-h).Then (i) the autocovariance function of Xt and Xt+h, Xt and Xt-h are given by (ii) the h-step-ahead conditional mean is given by and when h'+∞,E(Xt+h|+Xt)△E(Xj+i);(iii) the h-step-ahead conditional variance is given by and when h'+∞,Var(Xt+h|Xt)'Var(xj+i).Furthermore,we discuss the stationarity and ergodicity.Theorem1If{Xt)s satisfies (2), then, for each j with1≤j≤T,(Xj+kT κ∈No} is an irreducible, aperiodic and positive recurrent (and hence ergodic) Marhov chain, and the stationary distribution of {Xj+kT: κ∈No} is given by that ofwhere the series Y converges almost surely and also in L2, and for all t, n,1E N,Finally, we study the estimation of the PNGINAR(1)T process. When the data is complete, we derive conditional least squares estimators (CLS), Yule-Walker estimators (YW) and conditional maximum likelihood estimators (CML). Also it is proved that the limit distributions of CLS-estimators and YW-estimators are asymptotically normal. Suppose that {X1,…,XNT}comes from (2), where N∈N and X0=xo0. where μ0=μT.Then we obtain the CLS-estimators by minimizing the function Q(θ), denoted bywhereand Theorem2For the CLS-estimators θCLS in(3),when N'+∞,where0=(0,…,0)’ is avector of dimension2T×1,Here for1≤i≤T, Ωi,11=μi-1(1+μi-1)(αi2(/μi-1-μi-12+1)+αi(2/μi-1+1)+μi(1+μi-1)), Ωi,22=αi+12μi+1(1+μi+1)+(1-αi+14)μi(1+μi)-αi2μi-1(1+μi-1), Ωi,12=αiμi-1(1+αi)(1+μi-1), Vi,21=-αi2μi-1(1+αi)(1+μi-I), Vi,22=αi3μi-1(1+μi-1)-αiμi(1+μi), μ0=μT, αT+1=α1.Using μi=E(Xi+kT) and-γi(1)=βi+1,1Var(Xi+kT), we can derive the YW-estimators, denoted bywhereandTheorem3For the YW-estimators θYW in (4), when N'∞, where0, A1, A2are defined as in Theorem2.The CML-esμimators are obtained by maximizing the log-likelihood functionwhere Thus the CML-estimators can be obtained by solving the equations (?)=0.Finally, we employ the conditional least squares method with no imputation, the subgroup mean imputation, the imputation based on the likelihood and the bridge imputation to estimate parameters of the PNGINAR(1)T process in the presence of missing data, and the performances of these four approaches are compared via simula-tions. We get the following conclusion:as for the Bias and MSE, the the conditional least squares method with no imputation and the bridge imputation are better than the other two methods; as the missing probability vary, the conditional least squares method with no imputation and the bridge imputation are more stable than the oth-er two methods, and the subgroup mean imputation is the most sensitive to missing probability.Note that the approaches in the above parts are proposed for some specific models. But a nonparametric method is more reasonable for treating missing values in real data set in some situations. In the third part, we introduce two nonparametric methods to impute missing data in a non-negative integer-valued time series. These methods depend only on non-missing observations, so they have more extensive applicability in practice.It is assumed that we have a non-negative integer-valued time series{Xt: t∈N} satisfying the following three conditions:(a){Xt} is a stationary time series;(b) ρ(T)>0and ρ(T) is a decreasing monotonic function, where ρ(T)=(?),T∈No;(c) The observed data set is{Xs:sES(?)J}, where J={1,…n}, S={s1,…sd} with S1<…<Sd.The data{Xr:r∈RGJ} are missing completely at random, where R=J\S={r1;…, rg} with r1<…<rg.Under these assumptions, we give two nonparametric methods to impute missing data.Firstly, we introduce the kernel imputation (KER). The idea of kernel imputation is that each missing value is estimated by the rounded weighted average of all the non-missing values. For a given missing value Xrio, we impute it by the following formwhere [·] stands for rounding,{wdj(Tio):1≤j≤dd} are kernel weight functions given byHere K(·) is a kernel function and h is a positive bandwidth parameter. The optimal value hopt of h is obtained by the cross validation (CV) method, i.e.Secondly, we introduce the κ-nearest neighbor imputation (KNN).Definition2Let non-negative integer-valued time series {Xt, t∈N} satisfy con-ditions (a),(b) and (c). For each rio∈R, there exists a permutation Tio (1),…, rip (d) of s1,… sd satisfying(i)rio(1)-riol≤rio(2)-rio|≤…≤rio(d)-rio|,(ii)if|rio(m)-rio|=|rio(l)-rio|,and rio(m)<rio(l),then m<l.Xrio(j) is called the j-th nearest neighbor of Xrio in {Xs1,…Xsd},and for each k with1≤κ≤d,{Xrio(1),…,Xrio(k)}is called the κ-tj nearest neighbor set of Xrio in {xs1,…,Xsd).The idea of the k-th nearest neighbor imputation is that each missing value is estimated by the rounded weighted average of its own κ-th nearest neighbor set.The κ-nearest neighbor estimate of Xriois given bywhere{wd,k(i),k(i):1≤i≤d)is the κ-th nearrest neighbor(κ-NN)weight function,and satisfies The optimal value kopt of k is obtained by the CV criterion, i.e.Finally, we investigate the effect of imputation via simulation study. We choose Gaussian kernel for the KER method, and the three kinds of κ-NN weight functions (Stone,1977) for the KNN method. From the simulation results, we can see that the imputed effect of the KER method and the KNN method are better than the mean imputation, especially the missing probability is small; under the different sample sizes and missing probabilities, the KER method is best in the effect of imputation; for the three kinds of weight functions given by Stone (1977), the triangular k-NN weight function is the best choice for the KNN method.
Keywords/Search Tags:Integer-valued time series, missing data, periodic model, autoregressive processthinning operator
PDF Full Text Request
Related items