Font Size: a A A

Grouped Feature Screening For Ultra-high Dimensional Data

Posted on:2022-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:H J HeFull Text:PDF
GTID:2480306521452424Subject:Statistics
Abstract/Summary:PDF Full Text Request
As a typical form of big data,ultra-high-dimensional data exists widely in the fields of medicine,biology,and social surveys.General statistical analysis methods suitable for low-dimensional and high-dimensional data are no longer applicable in the case of ultra-high-dimensional data due to the complexity of the model and the characteristics of ultra-high-dimensional data.Therefore,ultra-high-dimensional feature screening methods to solve these difficulties have been proposed one after another.In the analysis of genomic data and economic data,there are often some covariates in the form of groups.In order to completely filter the covariates in grouped form,it requires further development of group feature screening methods for ultrahigh-dimensional data.This article is based on the typical method information gain and Pearson's chi-square statistic in the feature selection of ultra-high-dimensional discrete data.In view of the limitations of the existing methods,the new methods for feature selection of ultra-high-dimensional data sets are given under full data and randomly missing data.It not only proves theoretically that the proposed method has the consistency of feature screening,but also verifies the effect of the new method in group feature screening and classification models through numerical simulation and empirical analysis.Firstly,under full data,an improved grouped information gain feature screening(GIGSIS)and a grouped Pearson chi-square statistic screening method(GASIS)are proposed.The joint information entropy and joint probability are used to represent the amount of information of the group data,indicating the importance of the group data in the covariate,and the univariate feature screening method is extended to the grouped variable feature screening.Numerical simulation shows that the two grouped variable feature screening methods are better than the univariate feature screening method,and GASIS is superior to GIGSIS in terms of stability.The empirical analysis shows that these methods can be applied to the variable feature screening of the classification model,and the result of the group feature screening is excellent.Secondly,under randomly missing data,a two-stage grouped feature screening method(GIMCSIS)based on adjusted Pearson's chi-square statistic is proposed.A new definition of the missing indicator variable of the group of covariates is given,and the full observation of the group of covariates is used as the criterion for missing or not.In the second stage,new missing indicator variables are used to define new screening statistics,and the two-stage screening method is extended to grouped variable feature screening.The classification results of numerical simulation and colon cancer data further show that the grouped variable feature screening method GIMCSIS can select covariates with good predictive ability and is more robust than univariate screening methods.Finally,the current research progress is summarized and prospected.The grouped feature screening method is certain screening effectiveness and robustness,but the high computational complexity and the screening and estimation of missing data are the research focus of the next stage.
Keywords/Search Tags:Ultra-high-dimensional data, group feature screening, discrete group data, covariate missing at random, classification model
PDF Full Text Request
Related items