Grouped Feature Screening For Ultra-high Dimensional Data

Posted on:2022-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:H J He

Full Text:PDF

GTID:2480306521452424

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

As a typical form of big data,ultra-high-dimensional data exists widely in the fields of medicine,biology,and social surveys.General statistical analysis methods suitable for low-dimensional and high-dimensional data are no longer applicable in the case of ultra-high-dimensional data due to the complexity of the model and the characteristics of ultra-high-dimensional data.Therefore,ultra-high-dimensional feature screening methods to solve these difficulties have been proposed one after another.In the analysis of genomic data and economic data,there are often some covariates in the form of groups.In order to completely filter the covariates in grouped form,it requires further development of group feature screening methods for ultrahigh-dimensional data.This article is based on the typical method information gain and Pearson's chi-square statistic in the feature selection of ultra-high-dimensional discrete data.In view of the limitations of the existing methods,the new methods for feature selection of ultra-high-dimensional data sets are given under full data and randomly missing data.It not only proves theoretically that the proposed method has the consistency of feature screening,but also verifies the effect of the new method in group feature screening and classification models through numerical simulation and empirical analysis.Firstly,under full data,an improved grouped information gain feature screening(GIGSIS)and a grouped Pearson chi-square statistic screening method(GASIS)are proposed.The joint information entropy and joint probability are used to represent the amount of information of the group data,indicating the importance of the group data in the covariate,and the univariate feature screening method is extended to the grouped variable feature screening.Numerical simulation shows that the two grouped variable feature screening methods are better than the univariate feature screening method,and GASIS is superior to GIGSIS in terms of stability.The empirical analysis shows that these methods can be applied to the variable feature screening of the classification model,and the result of the group feature screening is excellent.Secondly,under randomly missing data,a two-stage grouped feature screening method(GIMCSIS)based on adjusted Pearson's chi-square statistic is proposed.A new definition of the missing indicator variable of the group of covariates is given,and the full observation of the group of covariates is used as the criterion for missing or not.In the second stage,new missing indicator variables are used to define new screening statistics,and the two-stage screening method is extended to grouped variable feature screening.The classification results of numerical simulation and colon cancer data further show that the grouped variable feature screening method GIMCSIS can select covariates with good predictive ability and is more robust than univariate screening methods.Finally,the current research progress is summarized and prospected.The grouped feature screening method is certain screening effectiveness and robustness,but the high computational complexity and the screening and estimation of missing data are the research focus of the next stage.

Keywords/Search Tags:

Ultra-high-dimensional data, group feature screening, discrete group data, covariate missing at random, classification model

PDF Full Text Request

Related items

1	Some Studies On Feature Screening Of Ultra-high-dimensional Longitudinal Data And Group Structured Data
2	Ultra-high Dimensional Feature Selection And Mean Estimation Under Random Missing Mechanism
3	Research On Feature Selection Of Ultra-high-dimensional Competitive Risk Data Based On Correlation Rank
4	Gini-Index Based Feature Screening For Ultrahigh Dimensional Catagorical Data
5	Selection And Application Of High-Dimensional Complex Group Variables
6	Feature Screening Of Ultra-high Dimensional Classification Data With Exposure Variables
7	Variable Screening Methods For Ultra-high Dimensional Categorical Covariates
8	In The Case Of Ultra-high Dimensional Data, The Variable Filtering Of The Model Can Be Added
9	Research On Feature Selection Method Without Model Constraints Under Ultra High Dimensional Data
10	Ultra-high Dimensional Missing Data Analysis Based On Model Averaging