Font Size: a A A

Variable Selection For Some High-dimensional Models And Re-modeling For The Selected Models

Posted on:2012-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J GeFull Text:PDF
GTID:1480303353953819Subject:Financial mathematics and financial engineering
Abstract/Summary:PDF Full Text Request
Large data sets generated in financial markets, gene expression array analysis and combinatorial chemistry and other fields have been of interest in the past three or four decades, since the rapid development of network and computer storage capability has enabled one to collect, store and analyze them. These data sets usually have very high dimension (large p, small n). If one uses the tens or hundreds of thousands of variables or features to modeling directly, it is usually cost-ineffective and has a poor prediction performance. Variable selection is the technique of selecting a subset of relevant features for building robust learning models.This thesis mainly concerns about two subjects:variable selection and model bias-correction. On one side, we study the model selection consistency of Dantzig selector, which is one of the popular variable selection methods. And then we continue to study the large sample properties of adaptive Dantzig selector. Both of which arc studied for the high-dimensional linear model setting. On the other side, for the biased sub-model, we adjust the model by adding a nonparamctric term in the model and correct the sub-model partially.A great deal of methods have been proposed for high-dimensional variable selec-tion problem. Dantzig selector, as an effective variable selection method, proposed by Candes and Tao (2007) has become very popular. But its large sample properties has hardly been studied except for Dicker and Lin (2009). Dicker and Lin (2009) obtained that Dantzig selector is model selection consistent under random design matrix when the variable size p is fixed. In Chapter 2, we obtain the model selection consistency of Dantzig selector under fixed design for all scales of p. Consider the linear regression model where y=(Y1,Y2,…,Yn)'is a n×1 response,X=(X1:X2,…,Xn)'=(X1,…, Xp)is an n×p fixed design matrix with.Xi the ith row of X and xi the jth column of X1,and?=(?1,?2,…?n)' is a n-vector of i.i.d random errors with E(?1)=0 and E(?12)=?2.Lct T*={j:?j?0},that is,wc refer to T* as the true modcl.And for some subsets T(?){1.2,…,p),|T| dcnctes the numbor of clements in subsct T, T is denoted as the complement of T in the set{1,2,…,p},and?T=(?j)j?T bc the |T|×1 vcctor whose cntries are those of?indexed by T.Dcnoto C=X'X?n,for Subsets T1,T2(?){1,2,…,p),lct CT1,T2 be the |T1|×|T2|sub-matrix from C with rows corresponding to T1 and columns corresponding to T2,where.Then we first define the Irrepresentable Condition of Dantzig selector under fixed design casc.Assumc for some E(?){1,2,…,p}with |E|=|T*|,where CT*,E is invertiblc,Irrepresentablc Condition of Dantzig selector is defined as follows.The inequalityholds and there exists a positive constant veetor?where 1 is a(p-q)×1 vector of all componcnts 1 and |.| means the inequality holds element-wise in absolute value.Here Irrepresentablc Condition means that insignificant predictors are irrepre-sentable by significant ones.Irrepresentable condition plays an important role for the consistcney of Dantzig selector.Under the Irrepresentable Condition,we obtain that the Dantzig selector could consistently select the true model for both the cascs with fixed p(number of predictors)and diverging p even at an cxponencial rate of n,where the consistcncyy refers to the sign consistcncy in probability,that iswhere?P(?)is the solution of Dantzig selector,?is the penalization paramcter. We also invcstigate conventional consistcncy of the estimation after variable selection,and obtain the consisteney only if the significant variable size q=o(n). The Dantzig selection, as has been shown in Chapter 2, can be consistent if the underlying model satisfies the Irrepresentable condition, yet when Irrcprcscntablc condition does not satisfied, the model selection consistency would not hold any more. Besides, Dantzig estimation could not attain the oracle property in the sense of Fan and Li (2001) and Fan and Peng (2004). However, the asymptotic setup is somewhat unfair, because it forces the coefficients to be equally penalized. So in Chapter 3, we assign different weights to different coefficients and provide the weighted Dantzig selector, that is the so-called adaptive Dantzig selector. For the adaptive Dantzig selector, we study its asymptotic properties in sparse high-dimensional linear regression models. If a reasonable initial estimator is available, we show that the adaptive Dantzig selector has the oracle property under some appropriate conditions no matter p growing in an polynomial rate or in an exponential rate with n without constraint of Irrepresentable Condition, that is the adaptive Dantzig selector,?(ADS) satisfyingsatisfying||(?)n||?1.Available initial estimators arc also provided as the weights of the adaptive Dantzig selector for both p?n and p>n in the last of Chapter 3.In application, important variables arc usually selected out according to the prac-tical experience. For example, in medicine, to search for the pathogenic genes related to some cancer is depending on the clinical trials, which usually could not selected out all the pathogenic genes. Besides, in some cases, even if we use a variable selection method, which is model selection consistent such as Dantzig selector, one could not always make sure to select out the true sub-model successfully. Thus the sub-model that we have used in application is usually biased. If we just use the biased sub-model to make forecast or control, it will show us poor instructions. So it's very necessary and meaningful for us to deal with the biased sub-model and try to correct or at least reduce its bias. In Chapter 4, we rc-modcling the sub-models so that the final models are identifiable and unbiased. Instead of linear model, here we just study a kind of more wide model-partially linear model, which is defined as where Yi arc i.i.d observations of the response variable Y and(Ti,Xi(?),Zi(?)) are obser-vations of the associated covariatcs (T, X'. Z'),?= (?1.…,?p)'is a p-dimensional vector of unknown parameters,?=(?1,…,?q)'is a q-dimensional vector of unknown parameters, g(·) is an unknown function. For fear of the curse of dimensionality, we assume, for simplicity, that T is a univariatc,?i(?)s arc i.i.d. errors satisfyingHere the dimension q of?may be very high and even may tend to infinity as the sample size increases. We suppose that Z is relatively insignificant and thus is removed from the full model above. We write the sub-model asSuch a model is biased, because the components of?is just relatively small but not zero. To deal with this problem, a nonparamctric adjustment procedure is provided to construct a partially unbiased sub-model. The adjusted sub-model is constructed aswhere?is some given vector. Partially unbiased sub-model here means that we could construct a sample subspaccs and show that both the adjusted restricted-model esti-mator and the adjusted preliminary test estimator of the adjusted sub-model above arc consistent, when the samples drop into the given subspaccs, the estimators arc consistent. Although we just correct the model bias partially, luckily, such subspaccs are large enough in a certain sense and thus such a partial consistency is close to global consistency.Also some simulations and real data analysis arc provided to illustrate the new methods.
Keywords/Search Tags:Variable selection, Dantzig selector, Irrepresentable Condition, model selection consistency, adaptive Dantzig selector, oraclc property, high-dimensional settings, biased sub-model, partially linear model, consistent estimator, nonparamctric adjustment
PDF Full Text Request
Related items