Font Size: a A A

Feature Selection And Bias-reduced Consistent Inference For Several High Dimensional Models

Posted on:2019-03-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LuFull Text:PDF
GTID:1367330572956653Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,high dimensional data appear more and more frequently in various fields,such as biomedical imaging,X-ray tomography,genotype-phenotype analysis,finance,earth science and etc.As the name suggests,the most typical feature of high dimensional data is that the dimensionality is usually larger than the sample size,especially for the ultrahigh dimensional data,the dimension can always grow exponentially with the sample size.In such a situation,many classical statistical methods can no longer work,for example,the Gram matrix is ill-conditioned when implementing the least square estimate if the dimension of predictor is larger than the sample size.To solve this problem,statisticians have developed two very popular techniques,feature screening and variable selection,which combined together have been the indispensable tool to handle the high dimensional data,and the study on these two methods also have been the most popular area in recent decade.This thesis also focuses on this field,we propose several new feature screening methods for different ultrahigh dimensional models and establish a new bias-correction estimation procedure for the high dimensional partial linear single index model.1.We propose a new feature screening method for the multi-response varying coefficient linear model(MVCLM).The concept of sure independence screening(SIS)is first proposed by Fan and Song(2010),in their article,they clarified clearly the importance of SIS in dealing with the ultrahigh dimensional data.Since the fundamental work of Fan and Lv(2008),a great deal of literature has extended the SIS method to various statistical models.This paper is the first one to generalize SIS to MVCLM,which is a class of important semi-parametric model with simple structure and flexibility and plays an important role in describing the dynamic relationship between the response and the covariates.Note that the MVCLM will degenerate to the common linear model given the varying coefficient,thus we define a new conditional canonical correlation(CCC)to capture the dynamic relationship between each predictor and the multivariate response,thinking of the varying coefficient as a conditional variable.Then,we take the expectation of CCC over the varying coefficient and set it as the screening index to rank the predictors.In practical implementation,we use the Nadaraya-Watson(NW)method with the associated bandwidth selection criterion AICC criterion proposed by Hurvich et al.(1998)to estimate the conditional means contained in the screening index.We prove the sure screening property of the newly proposed method,that is,the new method can selected all the active variable into model with probability approaching to 1.Addition-ally,a corresponding iterative version of our method is also put forward to deal with the strong correlation problem among predictors.Both the simulation and real data analysis examine the excellent performance of the new method.2.As we all know,most of the existing screening methods are established based on idea of marginal utility,therefore,it is often the case that these unconditional screening methods could miss important variables when the predictors are highly correlated.Although many statisticians have proposed the iterative methods to repair this defect,there is still some occasions that the iterative methods will fail,more importantly,the sure screening property of these iterative methods is still in question.In practice,researchers usually can get several important predictors in advance according to the previous research and experience.With these known predictors as prior information,Barut et al.(2016)firstly proposed the concept of conditional sure independence screening in the context of generalized linear model.After that,Hu and Lin(2017)and Lin and Sun(2016)also made some progress on this issue.However,all these existing conditional methods need to specify the model structure,once the model misspecification occurs,these methods would break down easily.Motivated by these observations,this paper proposes a completely model free conditional screening procedure.To this end,we employ the conditional distance correlation(CDC)to describe the nonlinear dependence between the predictor and response,in which the conditional variable is set as the prior predictors available in advance.Finally,we take an integral of CDC over these conditional variables and use the integration as the screening index.Thanks to the properties of CDC,we do not need to specify any regression relationship between the response and predictor.In practical use,we still invoke the NW method to compute the sample form of CDC.To avoid the possible "dimensionality dilemma" caused by the high dimension of conditional variable,we borrow the idea in Lavergne and Patilea(2008)that transforming the integral over multidimensional space to the integral over a unit sphere.Lavergne and Patilea(2012)also provided a special route to simply compute the multiple integral on the unit sphere.Under some regular conditions,we prove the sure screening property of the newly proposed method.The numerical studies demonstrates that our method is not only completely model free but also able to overcome the negative effects caused by the strong correlation among predictors.The real data analysis further illustrates the effectiveness of the new method.3.The data with ultrahigh dimensional predictor and discrete response is often faced by biostatistics practitioners working on multi-class categorical problems,for example in the cancer diagnose,researchers usually need to establish the classification model to identify the type of cancer according to the genetic data.It is not wise to take all the predictors into consideration when building a classification model because most of the predictors are actually noise.In this paper,we propose a new feature screening method based on the conditional characteristic functions.Note that if a predictor has no effect on predicting the response,then its characteristic function would equal to the corresponding conditional characteristic function given the response.Based on this fact,we define a special weighted Euclidean distance to measure the difference between the characteristic function and the conditional one of the predictor.By choosing a special weight function,we prove that the newly proposed distance is essentially a sum of two second moments,which brings great convenience to perform the screening procedure.It is worth mentioning that the newly proposed distance is actually a transformed version of the distance correlation.Under some regular conditions,we prove the sure screening property of the new method.Both the numerical simulations and real data analyses are conducted to illustrate the superior performance of our method over all the existing competitors.4.In the study of high dimensional data,we observe some interesting phenomenons.Firstly,as a "fast but dirty" method,feature screening is easy to miss some important variables;Secondly,when making further variable selection,for example the classical stepwise variable selection method(Efroymson,1960),there is still the possibility of missing important variables.When either one of the above two situations occurs,the model will be probably biased.Moreover,both variable selection and feature screening need the sparsity assumption,which,as Donoho and Jin(2015)stated,will be violated in many occasions.Thus,when the model is non-sparse,both feature screening and variable selection would incur a biased model.In this paper,we study the bias-correction method for the partial linear single index model(PLSIM).Firstly,under the linearity condition and the smoothness of the single index regression function,we prove that the PLSIM is equivalent to a pro forma linear model,which reduces the complexity in further statistical inference.When the working model,which might be biased,is determined,we employ the spectral decomposition to exact the bias information from the variables outside the model and reformulate bias as an artificial variable.Then,we add the artificial variable into the biased working model such that the working model is no longer biased.At last,the model parameters can be estimated consistently by the least square method.We also propose a new consistent method to estimate the dimension of artificial variable.Both the numerical simulation and real data analyses show that the new method can correct the bias very well.
Keywords/Search Tags:feature screening, conditional distance correlation, conditional canonical correlation, varying coefficient models, discrete response, bias-correction estimation
PDF Full Text Request
Related items