Font Size: a A A

Simulation Study And Empirical Analysis Of Several Variable Selection Methods

Posted on:2015-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:S L GaoFull Text:PDF
GTID:2250330431453686Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
For high dimensional data, the OLS method no longer works. In order to improve the interpretability and predication accuracy of models, variable selec-tion has been playing a significant role. Statisticians consider how to efficiently select those important ones which make great contribution to response variable among the large number of arguments. Statistician Tibshirani proposed the important Lasso method in1996, opening the era of variable selection. From then on, several variable selection methods for high dimensional data arose in succession. Five methods are frequently used, including Lasso, Adaptive Las-so, Elastic Net, SCAD and SIS. The former four approaches impose penalty on the basis of OLS to control the length of β. When brought foward the methods, statisticians proved rationale and conducted numerical simulation. For some methods, statisticians also compared with correlated methods.The thesis aims to comprehensively compare the above five methods throug numerical simulation and empirical analysis. In the numerical simulation sec-tion, according to the relation between n (sample size) and p (dimensional-ity) and magnitude of the correlation coefficients among the arguments, six situations are taken into account to compare the performance of the above methods. In the empirical analysis section, we quote the data from a study on acute lymphoblastic leukemia and data studying spam screening to apply the above methods for variable selection.Analyzing the results of simulation and empirical study, it turns out that all the five methods are able to select variable effectively.(1) As a milestone in the history of variable selection, Lasso does shrink the coefficients of variables towards0and set some to0exactly because of the geometric properties of its penalty term.(2) Being equivalent to weight the penalty term, Adaptive Lasso modifies the penalty of Lasso and further shrinks the models. The re-sults prove that the model selected by Adaptive Lasso is sparser than that by Lasso, improving the interpretability. More importantly, the method satisfies the Oracle property whereas Lasso doesn’t.(3) Elastic Net is a combination of Lasso and ridge regression and parameter alpha controls the weight, inheriting the merit of both methods. The thesis shows that it selects more variables than Lasso and most importantly, when group effect occurs it exhibits its unique advantage whereas other four methods all fail to work.(4) SCAD can reduce dimensionality dramatically and usually select fewer variables compared with others. The estimator obtained by SCAD satisfies three properties of unbi-asedness, sparsity and continuity.(5) SIS is a rough dimensionality reduction technique for ultra-high dimensional data through considering correlation co-efficients between covariates and response variable. We first roughly reduce dimensionality by using SIS and then apply one of other four methods. For ultra-high dimensional data, the results reveal such procedure outperforms other four methods without first using SIS.
Keywords/Search Tags:Variable Selection, Lasso, Elastic Net, SCAD, SIS
PDF Full Text Request
Related items