Font Size: a A A

Variable Screening And Model Prediction In Ultra-high Dimensional Complex Data

Posted on:2020-07-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H XieFull Text:PDF
GTID:1487305762962209Subject:Socio-economic statistics
Abstract/Summary:PDF Full Text Request
The statistical analysis of ultra-high dimensional data is an interesting topic in modern statistics in recent years.The dimension of covariates can grow exponentially with the sample size in ultra-high dimensional data.It is usually assumed that only a few covariates contribute to response.In this case,variable screening is widely utilized to find out these important covariates.Various methods have been developed to screen important variables in recent years.However,these existing methods can not directly be used to screen important variables for the case where the population heterogeneity and the large-scale high dimensional imbalanced data.Due to the qualitative nature of the outcome variable and the high dimensionality of the predictors,it is rather challenging to screen important variables.Meanwhile,model selection and model averaging are two popular approaches to improve prediction accuracy in regression analysis.Also,missing data are frequently encountered in various fields such as biomedical,social and psychological studies,due to various reasons,such as unwillingness of some sampled individuals to answer sensitivity questions,loss of information caused by uncontrollable factors,some scheduled visits intermittently or drop out of the study.Ignoring missing data may lead to prediction bias and estimator bias.To this end,in the framework of ultra-high dimensional data,two novel variable screening methods are developed to deal with the large-scale imbalanced data or heterogeneous categorical data,and two ultra-high dimensional model averaging methods are proposed to deal with the prediction problems of quantile regression models and linear regression models with missing responses at random.The main content of this thesis is as follows:1.This thesis proposes a new robust variable screening procedure for the case-control sampling with large-scale high dimensional imbalanced data.To pursue a ranking index that is less sensitive to the case-control sampling design,this thesis considers a fused ranking utility by repeating the case-control sampling for several times.Under some regularity conditions,this thesis establishes the sure screening and the ranking consistency properties of the proposed procedure.Simulation study and an example analysis are investigated to illustrate the effectiveness and feasibility of the proposed methods.2.This thesis develops a category-adaptive screening approach for analyzing ultrahigh dimensional heterogeneous categorical data.By defining dummy variables associated with each categorical level respectively,one appealing feature of the newly proposed procedure is that it is able to provide a complete picture of the heterogeneous nature of the categorical response given predictors.The proposal is a model-free approach without any specification of a regression model and an adaptive procedure in the sense that the set of active variables is allowed to vary across different categories,thus making it more flexible to accommodate heterogeneity.Meanwhile,the proposed procedure can be directly applied to the response-biased sampling data without any modification Under some regularity conditions,this thesis establishes the sure screening and the ranking consistency properties of the proposed procedure based on prospective sampling or response-biased sampling design.Simulation study and an example analysis are investigated to illustrate the effectiveness and feasibility of the proposed methods.3.This thesis considers the prediction problems in the quantile linear regression with ultra-high dimensional data.We propose a computationally feasible sequential quantile model averaging(SQMA)method that combines a sequential screening process and a model averaging algorithm.The main idea of the proposed method is that the candidate models with size 1 are considered in each sequential step.Subsequently,the weight of each candidate model is determined by the Bayesian information score.Therefore the proposed method can effectively deal with ultra-high dimensional data and save a great deal of computational costs.Meanwhile,it can provide a more accurate and stable prediction.Under some regularity conditions,this thesis shows that the proposed SQMA method has good fitting capability and can mitigate overfitting.Simulation study and an example analysis are investigated to illustrate the effectiveness and feasibility of the proposed methods.4.This thesis considers the ultrahigh-dimensional prediction problem in the presence of missing responses at random.A two-step model averaging procedure is proposed to improve prediction accuracy of conditional mean of response variable.The first step is to specify several candidate models,a new feature screening method is developed to distinguish from the active and inactive predictors via the inverse probability weighted rank correlation(IPWRC)in this step,and candidate models are formed by grouping covariates with similar size of IPWRC values.The second step is to develop a new criterion to find the optimal weights for averaging a set of candidate models via the weighted delete-one cross-validation(WDCV).Under some regularity conditions,this thesis establishes the sure screening and the ranking consistency properties of the proposed procedure.Meanwhile,this thesis proves that the derived weights are asymptotically optimal in the sense that the corresponding weighted squared error is asymptotically identical to that of the infeasible best positive model averaging estimator.Simulation study and an example analysis are investigated to illustrate the effectiveness and feasibility of the proposed methods.
Keywords/Search Tags:Ultrahigh dimensional data, Complex data, Response-selective sampling, Variable screening, Model Averaging
PDF Full Text Request
Related items