Font Size: a A A

Variance Estimation Based On 3×2 Cross Validation In Ultrahigh Dimensional Linear Regression

Posted on:2018-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:W N YanFull Text:PDF
GTID:2310330521451375Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Variance estimation is one of the basic statistical inference problem in regression analysis.It plays an fundamental role in the inference for regression coefficients,hypothesis testing,and the choice of the regulating parameters of the variable selection.For traditional linear regression,a natural approach to estimate the variance is following two procedures.In first stage,a model selection tool,such as AIC,BIC,is applied to select a model.In the second stage,the variance is estimated by an ordinary least squares method based on the variables selected in the first stage.This method is named the least squares estimation(LSE).For the traditional linear regression model,the LSE is UMVU estimate(uniformly minimum unbiased variance estimation).However,in ultrahigh dimensional linear regression,the dimension of the variables is always greater than the sample size,the LSE are not directly applicable.Fan et al.(2012)showed that the estimated variance of the least squares exhibited high biasedness,and the biasedness becomes larger as the increase of the dimension of variables.Therefore,Fan et al.(2012)proposed a new refitted cross validated variance estimation based on a two-fold cross validation,denoted as RCV.RCV uses one part of the data to select the model(variable),and another different part to estimate the coefficient and variance.A large number of experiments proved that the RCV corrected the bias of the LSE effectively.However,we noted that the performance of RCV exclusively depends on the precision of variables selection.If the variables selected on the one of the dataset cannot cover all true variables,the estimated variance of the other dataset will be biased.In order to improve the accuracy of variance estimation,Fan et al.(2012)proposed that repeated 2 fold cross validation(repeated RCV)can be used to estimate variance.However,just one group of selected variables can't cover all true variables and the repeated RCV cannot lead to good results.In fact,in ultrahigh dimensional linear regression,the useful variables are sparse.So we usually use the SIS method to select variables firstly,then estimate the parameter with the selected variables.However,when the SIS is used to select the variables in the process of RCV,it often loses some true variables.Even though the repeated RCV can not improve the performances of variable selection,which contributes to the performance of RCV cannot be improved.In this paper,the blocked 3×2 cross validation technique is proposed to estimate the variance of ultrahigh dimensional linear regression.Specifically,the blocked 3×2 cross validation firstly divides the data into 4 parts,and chooses any two parts as training set with the rest two for the test set.Thus three replications of two-fold cross validation can be formed.Wang et al.(2014)demonstrated the superiority of the blocked 3×2 cross validation.The specific steps are as follows,first,a variable selection tool is performed respectively on six combinations of the blocked 3×2 cross validation,and then the sort for variables is implemented based on the frequency of occurrence in 6 votes of6 combination,and the top variables are selected to make up the selected model.Finally,variance is estimated with the selected model.We named the above method as the vote-blocked 3×2 cross validation(V-B3×2CV).A mass of experiments have been done to compare the statistical performances of the VB3×2CV and RCV.Experimental results showed that the obtained variance estimation of V-B3×2CV showed lower biasedness and smaller variance than those of RCV.Furthermore,additional simulations have been conducted to prove that the V-B3×2CV is intensive to the true model size.We applied the V-B3×2CV procedure to analyse the white wine data obtained from the UC Irvine machine learning repository,which further proved the V-B3×2CV method is superior.Finally,theoretical analysis proved the asymptotic normality of V-B3×2CV.
Keywords/Search Tags:Ultrahigh dimensional regression, RCV, the blocked 3×2 cross validation, variables selection, vote-block 3×2 cross validation, variance estimation
PDF Full Text Request
Related items