Font Size: a A A

Gini-Index Based Feature Screening For Ultrahigh Dimensional Catagorical Data

Posted on:2020-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:K W ChenFull Text:PDF
GTID:2370330623457306Subject:Mathematics
Abstract/Summary:PDF Full Text Request
Because of the rapidly development of technology,the collection and storage of ultra-high dimensional data is not a problem.But how can we analyze the data and find some interesting conclusions from them? That's a big question which we faced.As we all known,the dimension of these data is particularly large and often appears as an exponential growth trend of the sample size.Some traditional statistical methods are no longer applicable to such data.In general,for ultra-high dimensional data,we have the sparse principle,assuming that only a small number of predictors have a significant impact on the response variable.Under such a principle,many scholars have proposed a variety of methods for feature screening of ultra-high dimensional variables.One of the reasonable and effective methods is to divide the dimensionality reduction of ultra-high dimensional variables into two parts.Firstly,they used an efficient and convenient feature screening method to screen important predictive variables to reduce the data dimension to a controllable size quickly,which is generally smaller than the sample size;Secondly,they used some mature variable selection methods to reduce the previously selected data further,in order to achieve a good effect of dimensionality reduction.This paper focused on the first part,the rapid feature screening,which is based on the distribution of the data.We proposed a feature screening method(GB-SIS-2 and GB-SIS-M),based on the Gini-index,for ultra-high dimensional classification data.And then we extended GB-SIS into ultra-high dimensional data with response missing at random and proposed another feature screening method(GB-MAR).In chapter 2,this paper proposed a feature screening method(GB-SIS-2)for ultra-high dimensional binary data,using the Gini-index and considering the difference between Gini coefficient of the response variable and the conditional Gini coefficient after adding a predictor.Then,this paper proved that the GB-SIS-2 method satisfies the sure screening property via the big sample principle.In addition,the GB-SIS-2 method is still a model-free method,that is,there is no need to specify the dependency relationship between the predictor variables and the response variable in advance.Compared with many feature screening methods based on model assumptions,the GB-SIS-2 method does not have the problem of assuming model structure incorrect.At the same time,this paper also compared the screening effects of this and several other feature screening methods through several sets of Monte Carlo numerical simulations with different parameter assumptions.It can be seen from the results that the method is superior to other feature screening methods,and thus its finite sample properties are also verified.Finally,the real data of the micro-blog bloggers classification can also explain the practicability and effectiveness.In chapter 3,the paper extended the GB-SIS-2 method into the ultra-high dimensional multi-class data,and constructed the multi-class Gini coefficient feature screening method(GB-SIS-M).This method has also been shown to have all the properties and advantages of the GB-SIS-2 method in Chapter 2.This chapter also confirmed that GB-SIS-M method has good finite sample nature by using Monte Carlo numerical simulations of several different parameter hypotheses and real data of gene-site.In chapter 4,the paper considered that missing data also accounts for a large proportion in ultra-high dimensional data,and the existing research on the screening of ultra-high dimensional missing data features was scant.So in this chapter,we used the GB-SIS method proposed above,combined with the traditional inverse probability weighting method(IPW)to solve the missing data problem,and proposed a ultra-high dimensional feature screening method based on Gini-index for responses missing at random(GB-MAR).For the analysis of missing data,the inverse probability weighting method can retain the information to a greater extent than the complete case method(CC),which makes the screening processes more accurate.The GB-MAR method is also a model-free screening method.In the Monte Carlo numerical simulation,we can clearly see that the screening effect of the GB-MAR is significantly better than the GB-CC method,and GB-MAR is not affected by the missing proportion,that means it is stable enough.Finally,in the case of mail classification,the effect of GB-MAR method is not very different from the GB-F method under the full data,which showed the practicality and effectiveness of GB-MAR method.
Keywords/Search Tags:ultra-high dimensional data, missing data, feature screening, Gini-index, sure screening property
PDF Full Text Request
Related items