Font Size: a A A

Variable Selection For High-dimensional Linear Modelsy

Posted on:2011-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:L L LiFull Text:PDF
GTID:2120360305477933Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
High-dimensional data frequently appear in many areas such as Bioinformatics, Biomedicine,Econometrics and Machine learning. It makes the classical statistical methods fail in most cases.So it's one of the research fields with a lot of difficulties and challenges in the statistical theoriesand applications. In theoretical and applied studies of the high-dimensional data, the sparsitycondition is used frequently, that is, the true model lies in a low-dimensional space in which thenumber of the covariates is less than the sample sizes. If the sparsity condition does not be satisfied,the true model can not be indentified, then the variable selection will be meaningless. Accordingto some criterions, we can do the varible selecion and give an approximation to the true model.In practice, we often delete the variables which are weakly correlated or uncorrelated with theresponse variable, and choose the variables which are highly correlated with the response.In this thesis, the variable selection in high-dimensional linear model is investigated. Wemainly consider the case that the error distribution is unknown and the dimension p is large than thesample size n. Our method is a reasonable combination between the SIS (Sure independent screen-ing) or the ISIS( Iterative sure independent screening) in Fan and Lv(2008) and the AEL(Adjustedempirical likelihood) in Chen,Variyath and Abraham(2008). The title of Fan and Lv(2008) is"Sure independence screening for ultra-high dimensional feature space", which published in"Jour-nal of the Royal Statistical Society Series B(70:849-911)". The other paper is"Adjusted empiricallikelihood and its properties"published in the"Journal of Computational and Craphical Statistics(17,426-443)". In theoretically, we prove that the asymptotic properties of SIS and ISIS in Fan andLv(2008) are still hold without the gaussian assumption about the error distribution. Concretely,under some conditions, we haveP(M? ? Mγ) = 1 ? O(exp(?Cn1?2κ/log(n))),where M? is the true sparse model; Mγis the selected model that includs [nγ] variables, the [nγ]variables are highly correlated with the response variable;γ∈(0, 1), 1 ? 2κ> 0. The propertyindicates that the selected model can contain the true model in probability. In algorithmically,we give the SIS+AEL and Iterative SIS+AEL algorithms. The idea of SIS+AEL algorithm is as follows: Firstly, we choose [nγ] variables which are highly correlated with the response variable;Next we use the AEL to obtain the corresponding AIC and BIC, finally, the approximated modelare chosen by the minimum of AIC or BIC. The iterative SIS+AEL algorithm is the following.In the first step,we use SIS+AEL to choose xi1,···,xim1 from x1,···,xp, then establish a linearmodel with y and xi1,···,xim1. In the second step, we use SIS+AEL again to select variables, inthis case, insteading of Y and x1,···,xp, the response variable is the residual y ? (xi1β?i1 +···+xim1β?im1) and the covariates are the remaining p ? m1 variables except xi1,···,xim1. We keepon doing this until some criterions are satisfied. The algorithms not only remain the asymptoticproperties of SIS and ISIS, but also weken the assumption about the error distribution. The ideais straight and they can make up for each other's dificiencies. At last, we do some simulations.Simulation results show that if the error distribution is gaussian, the accuracy of ISIS+AEL inincluding the true model is close to that of LASSO(The popular method in variable selection); Ifthe error distribution is not gaussian, the accuracy of ISIS+AEL in including the true model isbetter than that of LASSO.The deserved results in the thesis are as follows:1. The SIS , the Iterative SIS and the AEL are combined systematically. The new method canmake up for each other's dificiencies and reduce the computation cost and widen the applicationfields.2. The restriction about the Gaussian error distribution is removed. Although the computationof SIS and ISIS from Fan and Lv(2008) is simple, the asymptotic properties will not be holdwithout the gaussian error distribution. In theoretically, we prove that the asymptotic propertiesof SIS and ISIS in Fan and Lv(2008) are still hold under some weaker conditions. The indexdimension p can reduce to m(m < n) by choosing the AEL method.3. Using the AEL method to select variables, we can overcome the shortcoming of empiricallikelihood. It is well know that, there is a precondition for empirical likelihood. In the estimateequation EFg(y,θ) = 0, the estimation ofθexists if and only if the convex hull of the {g(yi,θ),i =1,···,n} contains zero as an inner point. In order to avoid the systematic bias caused by thefalse precondition, we choice the Adjusted empirical likelihood proposed by Chen,Variyath andAbraham(2008).
Keywords/Search Tags:High-dimensional linear model, Variable selection, Sure independent screening method, Iterative sure independent screening method, Adjusted empirical likelihood
PDF Full Text Request
Related items