Font Size: a A A

Statistical Inference And Variable Selection Based On Non-smooth Estimating Equations Under Complete And Missing Data

Posted on:2011-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:1100330332984380Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
This thesis studies mainly the estimation of distributions and parameters and variable selection taking into account auxiliary information under complete and missing data. In statistical analysis, sometimes it may appear that we do not utilize all the available information in a better way. For example, when con-sidering least-square estimation, we may also know that the error's distribution is symmetric about 0 or the error's variance is a function of its mean. We of-ten consider these auxiliary information as unbiased estimating equations (EE). They sometimes are smooth, for example, the case that the error's variance is a function of its mean, but sometimes are not smooth, for example, the case of estimation of median, sample quantile and quantile regression, and so on. Not only do auxiliary information appears under complete data, but also appears under missing data and censored data.Because there often is much auxiliary information, we can not combine ar-bitrary them or we cannot use it well due to lack of methods. The present thesis constructs firstly EEs, and then utilizes the methods to handle estimating equations to put different weights for different auxiliary information to improve efficiency of estimation. The existed works show that these weighted estimators of distribution improved greatly the efficiency of estimators of empirical distribu-tion (uniform weight). Therefore, this thesis utilizes EEs to construct weighted estimators of distributions. EEs are some functions of parameters and variables of which expectations are equal to zero, for example, E{ψ(Y, X,θ)}= 0, i.e., the estimating functionψ(Y,X,θ) is unbiased. For the ordinary EEs, the number of parameters is usually more than that of EEs, called over-determined system. Over-determined system appears frequently in finance and economics, and bi-ological study. Therefore, practically and theoretically, it is very important to develop the methods of EE.There is much literature on statistical inference for smooth EEs in different kinds of data. However, in practice, many estimating functions are not smooth, for example, in case of median regression when we consider the symmetric dis-tribution about mean as an auxiliary information. The EE constructed by this way is not smooth with respect to these parameters. The existed methods are not valid for such non-differentiable EEs. This thesis employs a kernel smoothing technique to smooth these non-differentiable EEs, and studies the estimators of parameters and distributions of response variables, their asymptotic properties and small sample properties by simulations based on smoothed EEs. Under miss-ing data, we also consider how to construct the asymptotic unbiased EEs based on non-differentiable EEs. The estimators of parameters and distributions of response variables based on empirical likelihood and their asymptotic properties are also studied in detail.Model selection is always one of the most important task in statistics. Vari-able selection is fundamental to high-dimensional statistical modeling and impor-tant issue in applied econometric analysis and statistical inference. Especially, as the fast development of science and technology, the dimension of data is more and more high. Therefore, we have to find efficient methods to handle the high-dimensional data, which results in that variable selection is discussed frequently. The present thesis studies variable selection under the appearance of auxiliary information by using SCAD penalty, and also proposes the penalized empirical likelihood estimators and the penalized generalized method of moments (GMM) estimators. The consistency and Oracle properties of both estimators are con-ducted. By using MM algorithm, we calculate that by taking auxiliary informa-tion into account the ratio of correction of variable selection can be improved greatly.The case-control study is an important method for the study of factors related to disease incidence. There are many case-control studies in biostatistics. Most of literature considered the linear logistic regression model for case-control study. This thesis generalizes this linear logistic regression model to varying-coefficeint logistic regression model for case-control studies. Varying-coefficeint model overcomes some of the drawbacks such as linear models and methods may not be flexible enough and modeling bias may be produced due to misspecification of the parametric assumptions. Similarly, in case of pure nonparametric models it is hard to explain the estimated curves and we face curse of dimensionality problem with high dimensional data. By utilizing characteristic of Case-control experiment, we can obtain an auxiliary information, and then consider them as EEs. The present thesis proposes a nonparametric local empirical likelihood estimator for the varying-coefficient logistic model by taking into this auxiliary information account. Under some regularity assumptions, the estimator is shown to be consistent and asymptotically normally distributed.The thesis is divided into six chapters. In Chapter One, the background of estimation of distribution with different data and estimating equations is intro-duced. The research status of missing data, variable selection and case-control are also presented in Chapter One.In Chapter Two, the estimator of distribution function based on smooth estimating equations under missing data is studied. There is much literature on statistical inference for distribution under missing data, but surprisingly, before this very little previous attention has been paid to missing data in the context of estimating distribution with auxiliary information. In this Chapter, the auxiliary information with missing data is proposed. We use the kernel-assisted estimating equations imputation scheme to mitigate the effects of missing data through a re-formulation of the estimating equations. In this way, we can estimate distribution and the T—th quantile of the distribution by taking auxiliary information into account. Asymptotic properties of the distribution estimator and corresponding sample quantiles are derived and analyzed. The distribution estimators based on our method are found to be significantly better than the corresponding estima-tors without this auxiliary information. Some simulation studies are conducted to illustrate the finite sample performance of the proposed estimators.In Chapter Three, the estimator of distribution function of response vari-able based on non-differentiable estimating equations is considered. The main idea is to combine least-square and quantile regression to improve the efficiency of distribution function's estimator. In this Chapter we propose an estimator of the distribution of some variable with non-smoothed auxiliary information. A smoothing technique is employed to handle the non-differentiable function. Hence, distribution can be estimated based on smoothed auxiliary information. The distribution estimators based on our method are found to be significantly better than the corresponding estimators without these auxiliary information. Some simulation studies are conducted to illustrate the finite sample performance of the proposed estimators.In Chapter Four, we combine least-square and quantile regression to improve the efficiency of distribution function's estimator based on non-differentiable es- timating equations under missing data. The smoothing technique proposed in Chapter three and the kernel-assisted estimating equations imputation scheme in Chapter two are used to construct asymptotically unbiased estimating equations. The proposed maximum smoothed empirical likelihood estimators of unknown parameters still enjoy the consistency and asymptotic normality. The asymptotic properties of the distribution estimator of response variable are also analyzed. Some simulation studies are conducted to illustrate the finite sample performance of the proposed estimators.In Chapter Five, the penalized generalized method of moments and penalized empirical likelihood are studied by employing SCAD penalty based on estima-tion with non-differentiable auxiliary information. We smooth these estimating equation by the smoothing technique mentioned in previous Chapter. The pro-posed minimum penalized GMM estimators and penalized empirical likelihood of unknown parameters based on smoothed EEs (SEE) still enjoy the consistency, the asymptotic normality and the oracle property. Some simulation studies are conducted to illustrate the finite sample performance of the proposed penalized GMM estimators. Comparing to penalized least square, the proposed minimum penalized GMM estimator is better than that of the estimator based on penalized least square.In Chapter Six, we develop a varying-coefficient logistic regression model for case-control studies. We propose a nonparametric estimator of coefficient function for the varying-coefficient logistic model by local empirical likelihood. Under some regularity assumptions, the estimator is shown to be consistent and asymptotically normally distributed.
Keywords/Search Tags:Estimating equations, unbias estimating function, Missing data, Auxiliary information, Varying-coefficient logistic regression, Empirical likeli-hood, GMM, Variable selection, Case-control study
PDF Full Text Request
Related items