Font Size: a A A

Screening Method Of Ultrahigh Dimensional Dataset Based On Rank Correlation Coefficient

Posted on:2022-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiaoFull Text:PDF
GTID:2530306326974549Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the big data era,the analysis of ultrahigh dimensional datasets has become more and more important.Ultrahigh dimension means that the number of candidate covariates p is allowed to increase at an exponential rate of the number of observations n,while only a small number of predictors contribute to the response.This paper proposes two new methods for feature screening of ultrahigh dimensional datasets to improve the accuracy and computational efficiency of ultrahigh dimensional datasets analysis.The proposed methods are based on the rank correlation coefficients R and τp*,which measure the dependence between the response and covariates one by one,and then we put the covariates with strong correlation with the response into the model.This procedure is similar to SIS and DC-SIS,thus our methods are named as R-SIS and τp*-SIS respectively.Compared with the existing methods SIS,the above-mentioned feature screening procedure does not require model specification and is nonparametric.Secondly,the methods proposed in this paper can not only capture the linear relationship between the response and covariates,but also the nonlinear relationship between of them.In addition,R-SIS and τp*-SIS can be used in the case of grouped predictor variables and multivariate response variables.Compared with the advantageous DC-SIS method in the literatures,the proposed methods have more extensive applications in real data analysis because they do not require second moment conditions in distribution.In addition,the rank correlation coefficients R and τp*are more robust than the distance correlation DC,so the performance of R-SIS and τp*SIS are better than the DC-SIS when the error term is heteroscedasticity or there exist outliers.In order to explore the finite sample performance of R-SIS and τp*-SIS in screening important variables in ultrahigh dimensional datasets,we use simple linear model,the heteroscedasticity case of simple linear model,non-normal joint distribution in sample,existing non-linear relationship between covariates and response variables,and compared with the current very advantageous DC-SIS,in general,they have basically the same performance.However,when they are outliers or heteroscedasticity,the method proposed in this paper can still accurately identify important features in ultrahigh dimensional datasets,that is,the method proposed in this paper is more robust.Finally,this article uses a real data to illustrate the effectiveness of the method proposed in this article.
Keywords/Search Tags:Ultrahigh dimensionality, Feature screening, Rank correlation coefficient, Robust
PDF Full Text Request
Related items