Font Size: a A A

Study On Ultrahigh Dimensional Feature Screening And Its Application

Posted on:2019-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:L M WangFull Text:PDF
GTID:2370330545470156Subject:Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development and extensive application of data collection technology,it is possible to collect ultrahigh-dimensional data at relatively low cost in diverse scientific fields,such as environmental science,medical science.finance,to name but a few.However,it's difficult to process in discriminant and regression analysis due to " curse of dimensionality".ultrahigh dimensionality faces severe challenges in computational cost,estimation accuracy and the model interpretability.Therefore,feature screening for ultrahigh-dimensional data becomes a hot spot in related research fields.In recent years,many screening methods including linear model and free model have been proposed and studied.But in ultrahigh dimensional feature screening,response or predictors can be categorical or continuous,even in some specific situations,predictors contain categorical and continuous variables simulta-neously.Therefore,we propose a unified screening procedure,and exhibits a competent em-pirical performance in our intensive simulations and real data analysis.Air quality forecast is research focus of environmental science,we attempt to apply feature screening-distance correlation(DC)coefficient in atmospheric pollution forecast,and it has higher prediction precision by using DC.The subject matter of this paper including:(1)For various kinds of data(continuous variables and categorical variables),we pro-pose a unified screening index based on the difference between conditional distribution func-tion and unconditional distribution function of predictor.We use kernel smoothing method to estimate the conditional distribution function.Our unified screening index possesses the sure screening property and ranking consistency property under some regular assumptions.The new procedure has some additional desirable characters.First,it is model-free without any specification of a regression model or assumption of parameter.Second,it is robust against heavy-tailed distributions,potential outliers.Third,UFS has better performance in non-linear relationship.Some Monte Carlo simulations and real data examples are con-ducted to verify the finite sample performance.(2)Considering meteorological data is time series data,air pollutant concentrations and meteorological elements over the prior and current days are used as a set of predictors for improving model interpretability.But at the same time the dimension of data becomes higher,we apply feature screening method DC to atmospheric pollution forecast.The important predictors are selected by the DC from a set of predictors,and then support vector regression(SVR)is applied to predict the atmospheric pollution concentrations,developing a DC-SVR model of statistical air quality forecasting.By forecasting daily averaged PM2.5 concentrations in Huaian and hourly PM2.5 concentrations in Hangzhou and Nanjing,the DC-SVR model has higher prediction precisin compared to SVR model.Therefore,this DC-SVR model is capable of reducing the dimensionality of predictors,and could potentially be used in the operational air quality forecast in Yangtze River Delta.
Keywords/Search Tags:Ultrahigh dimensional data, unified screening, air quality forecast, distance correlation coefficient, support vector regression
PDF Full Text Request
Related items