Font Size: a A A

Some Study On Robust Statistical Inference And High-dimensional Data

Posted on:2016-02-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:L FengFull Text:PDF
GTID:1220330503950910Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Over the last two decades, nonparametric modeling techniques have been developed rapidly due to the reduction of modeling biases of traditional parametric methods.Local polynomial regression(Fan and Gijbels 1996) is widely used for nonparametric regression. However, the efficiency of least squares based methods is adversely affected by outlying observations and heavy tailed distributions. Thus, some efforts should be devoted to construct some robust nonparametric polynomial smoothers and corresponding test procedure. The robust approach to statistical modeling and data analysis aims at deriving efficient methods that produce reliable estimators and associated tests for a wide spectrum of distributions and when there are some outliers. Traditional robust methods, including M-estimates, L-estimates, R-estimates, have been widely used in statistical problems and shown to possess many good properties. In the first part of this dissertation, we extend the traditional R-estimates to the nonparametric models.In the past decades, high-dimensional data have been generated in many areas,such as hyperspectral imagery, internet portals, microarray analysis and DNA. One important feature of high-dimensional data is that the dimension p often greatly exceeds the sample size n, which brings great challenges to many traditional statistical methods and theories. This large- p-small-n paradigm is translated to a regime of asymptotics where p increases to infinity as the number of observations n ' ∞, particularly p n. The traditional test procedure, such as Hotelling’s T2 test, can not work in highdimensional settings because the sample covariance matrix is not invertible. Furthermore, there would be a non-negligible bias-term in the traditional test statistics when the dimension p is ultra-high because the sample estimators are only root-n consistent.Thus, some novel test procedures should be constructed to overcome this challenge. In the second part of this dissertation, we aim to address the high-dimensional test problems.In the first part of this dissertation, we aim to address the robust statistical inference of nonparametric models. We focus on the following topics: estimation, variable selection and test of some nonparametric models. We introduce and develop some new robust methodologies, with the help of rank-based procedure, to tackle these important but quite challenging problems. Next we give a brief introduction.The weighted rank-based L1 norm is often used in the development of robust Restimates(Hettmansperger and Mc Kean 1998). That is, ε W=√12n+1∑n i=1r i|εi|, where r i denotes the rank of |εi| among |ε1|,..., |εn|. It is equivalent to√12n + 1∑∑i< j εi-ε j2 +√12n + 1∑∑i≤ j εi + ε j2 ≡ W n(ε) + R n(ε).However, the first part W n is not applicable in estimation of nonparametric regression functions because the interception term does not affect it. Fortunately, if we additionally assume that the errors have a symmetric distribution, using R n instead would yield a quite robust and efficient estimation. In Section 第一节, we propose a novel method termed the local Walsh-average regression(LWAR) estimator by minimizing a locally Walsh-average based loss function R n. Theoretically studies show that the proposed estimator is asymptotic normal and highly efficient across a wide spectrum of distributions. Even though W n can not be used to estimate the interception, it also can estimate the regression coefficient part in nonparametric models. In section 第二节, we develop a robust estimator of the coefficients in single-index models, termed as rank-based OPG estimator(ROPG), which combines the ideas of rank-based regression inference W n and outer product of gradients(Xia 2006). In section 第三节, we extend this rankbased method to the varying coefficient models. We start by developing a novel robust estimator, termed rank-based spline estimator, which combines the ideas of rank inference W n and polynomial spline. Furthermore, we propose a robust variable selection method–RSSCAD, incorporating the smoothly clipped absolute deviation penalty(Fan and Li 2001) into the rank-based spline loss function. Theoretical analysis reveals that our procedure RSSCAD is consistent in variable selection; that is, the probability that it correctly selects the true model tends to one. Also, we show that our procedure has the so-called oracle property; that is, the asymptotic distribution of an estimated coefficient function is the same as that when it is known a priori which variables are in the model. The asymptotic relative efficiency of all these methods with respect to the least square based method is closely related to that of the signed-rank Wilcoxon test incomparison with the t-test. Both asymptotic and numerical results show that the proposed procedures have better performance than the least-squares-based method when the errors deviate from normal.An important inference questions about the nonparametric modeling techniques is whether a parametric family adequately fits a given data set. Fan et al.(2001) proposed a generally applicable method, termed as generalized likelihood ratio(GLR) statistic,for testing nonparametric hypotheses about nonparametric functions. However, the efficiency of this method also suffer form the same problems of local polynomial regression. To attack this challenge, in Section 第一节, a robust testing procedure, termed as Wilcoxon-type generalized likelihood ratio statistic(WGLR), is developed under the framework of the GLR by incorporating a Wilcoxon-type artificial likelihood function R n(ε) and adopting the associated local smoothers. Under some useful hypotheses,the Wilks phenomenon still holds for our WGLR; that is, the proposed test statistic is proved to be asymptotically normal and free of nuisance parameters and covariate designs. Its asymptotic relative efficiency with respect to the least squares-based GLR method is also closely related to that of the signed-rank Wilcoxon test in comparison with the t-test. It outperforms the least squares-based GLR with heavier-tailed data in the sense that asymptotically it can yield substantially larger power. The comparison of several regression curves is also an important problem of statistical inference. In Section 第二节, we extend this novel testing procedure WGLR to the comparison of two regression curves. Theoretical studies show that the Wilks phenomenon still holds in this problem. And simulation studies further demonstrate the theoretical results and show that our procedure outperforms other methods in the literature in most cases.In the second part of this dissertation, we aim to address the high-dimensional test problems. We focus on the following topics: one sample and two sample location tests,global test of regression coefficients in high-dimensional linear model. We introduce and develop some new test procedures to tackle these problems. Next we give a brief introduction of the second part.One sample and two sample location problems are also the classic problems in statistics. However, in high-dimensional settings, the classic Hotelling T2 test statistics do not work well. A natural idea is to use the Euclidian norm to replace the Mahalanobis norm in Hotelling T2 test statistics(Bai and Saranadasa 1996; Chen and Qin 2010).However, there are three drawbacks of those methods.First, those tests are not scalar-invariant. Srivastava, Katayama and Kano(2013)proposed a scalar-invariant test by sum of the p squared univariate Fisher’s test statistics. However, to derive the well-defined asymptotic null distribution, the dimension p must have a smaller order of n2 otherwise, their test would not have a well defined limit because of a non-negligible bias-term. In Section 第 一 节, we develop a novel scalar-transformation-invariant test by leave-one-out method for the high-dimensional Behrens-Fisher problem, which are able to integrate all the individual information in a relatively “fair” way. Now the dimensionality is allowed to grow in the rate, respectively, from square to cube of the sample size in different scenarios. Simulation studies are conducted to compare the newly proposed procedure with other existing testing procedures and show that our procedure generally has more robust sizes and powers.Second, those methods are all based on the assumption of multivariate normal distribution or diverging factor model and not robust and efficient for heavy-tailed distributions. This motivates us to consider using multivariate sign-and/orrank- based approaches(Oja 2010) to construct robust tests. In Section 第 二 节, we propose a multivariate-sign-based high-dimensional tests for the two-sample location problem.We show that the proposed test statistic is asymptotically normal under elliptical distributions. By using only the direction of an observation from the origin but not its distance from the origin, our proposed test would be more robust in certain degrees for the considered heavy-tailed distributions. Simulations also show that these multivariate-sign-based procedure are quite robust and efficient, especially for heavytailed or skewed distributions.Finally, those methods abandon all the correlation information and are not effective. In Section 第三节, we propose a composite T 2 test to overcome this issue. The first step is to sequentially select K variables which have the largest correlation among all combinations of K elements from the remaining variables. The second step is to construct p/K T2 test statistics and combining them together. Under mild conditions, the proposed test statistic is asymptotically normal, and allows the dimensionality to almost exponentially increase in n. This test inherits certain appealing features of the classical T2test and does not suffer from large bias contamination. Due to incorporating much correlation information, the proposed test can delivery more robust performance than existing methods in many cases.In genomic studies, it is important to identify significant sets of genes which are associated with certain clinical outcome. Recently, many efforts have been devoted to solve this problem by variable selection and screening procedures. In hypothesis testing, it can be advantageous to look for influence not at the level of individual variables but rather at the level of clusters of variables. Thus, in Section 第一节, we concern with simultaneous tests on linear regression coefficients in high-dimensional settings.When the dimensionality is larger than the sample size, the classic F-test is not applicable since the sample covariance matrix is not invertible. Recently, some testing procedures have been proposed by excluding the inverse term in F-statistics. However, the efficiency of such F-statistic-based methods is adversely affected by outlying observations and heavy tailed distributions. We propose a robust score test based on rank regression W n(ε). The asymptotic distributions of the proposed test statistic under the high-dimensional null and alternative hypotheses are established. Theoretical and simulation studies show that our procedure perform better than the other two methods in literature when the errors deviate from normal.Chapter 5 consists of conclusions and related future work.There are three innovation points in this dissertation.? First, we extend the classic rank-based procedures to the nonparametric models and high-dimensional linear regression models. We propose the local Walshaverage regression estimators for nonparametric model estimation and the corresponding Wilcoxon-type generalized likelihood ratio test statistics for nonparametric model checking. A rank gradient score test is proposed for the highdimensional regression coefficient.? Second, we propose a robust multivariate-sign-based high-dimensional tests for the two-sample location problem which is scale-invariant and efficient in a wide range of distributions.? Finally, we propose a composite T2 test statistics for the high-dimensional location problem, which utmostly use the information of correlation between the variables.
Keywords/Search Tags:BIC, Bootstrap, Generalized likelihood ratio, High Dimensional Data, Local polynomial regression, Rank regression, Spatial-sign
PDF Full Text Request
Related items