Font Size: a A A

Some Studies On Feature Screening Of Ultra-high-dimensional Longitudinal Data And Group Structured Data

Posted on:2020-05-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y NiuFull Text:PDF
GTID:1360330620451989Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology and the continuous reduction of data collection costs,ultra-high-dimensional data frequently appears in many scien-tific fields,such as genomics,biological imaging,tumor classification,economics,high-frequency transactions,machine learning and so on.The typical characteristic of such data is that the dimension p of the data is much larger than the sample size n.Specif-ically according to the definition of Fan et al.(2009),we assume logp=O(n?)for some? ?(0,1/2),which is named by ultra-high-dimensionality.Under the sparsity assumption,how to filter out a few covariates that are really important to response variables from ultra-high-dimensional data is the core of this problem.When dealing with such prob-lems,the traditional penalized variable selection methods always encounter the following three problems:computational complexity,statistical accuracy,and algorithm stability.Different from the idea of variable selection,feature screening excludes obviously unre-lated covariates by a quick screening method,so that the dimension of the covariates is reduced to a relatively moderate degree.Then the traditional variable selection method can be used successfully for model selection and related estimates.Therefore,ultra-high dimensional data feature screening is a very interesting research direction.To this end,several screening studies have been carried out on ultra-high dimensional longitudinal data and grouped data,and the main content is as follows.(1)Under the assumption of the additive model,we propose a method to deal with ultra-high-dimensional longitudinal data by using the method of marginal nonparametric regression.Different from the previous feature screening problem,such repeated measures are correlated within subjects.We fit marginal nonparametric regression model by using B-spline basis approximation and select the important covariates by ranking a measure of these estimators.Under some mild conditions,the sure screen-ing property is established for longitudinal data.From the perspective of algorithm,we propose an iterative algorithm based on data-driven selection of thresholds for pre-screening and post-variable selection named INIS-SAM and its greedy version.In order to further control the size of the final selected model,we apply a method of data splitting to the screening method to obtain split-INIS-SAM.Simulation result-s show the good finite sample performance of our method,and the advantages of our screening method are also demonstrated by analysis of the yeast cell cycle gene expression dataset.(2)Under the assumption of linear model,we propose a marginal ultra-high dimensional group variable screening method to deal with the ultra-high dimensional screening problem of grouped structure.Inspired by the idea of univariate screening,we fit each set of variables and response linearly,and measure the importance of each group of variables based on the goodness of the fit.Theoretically,we demonstrate that under certain conditions,the group screening method has the property of sure screening property.In order to enhance the finite sample performance of the group screening method,we propose a data-driven threshold selection method,and an iterative version group screening method based on it.We name it ISIS-Group-Lasso,and also obtain a greedy version of it,which is called g-ISIS-Group-Lasso.Simulation results show that our group screening method is better than other group variable screening methods,and we also apply this group method to the study of a clone data,and achieve good results.(3)In order to deal with the ultra-high-dimensional data with grouped structure,we fur-ther propose a marginal quantile group screening method without model assump-tions.It characterizes the relative importance of group variables based on quantile marginal fittness,and this method does not require any finite-moment assumptions In order to have a more comprehensive understanding of the ultra-high dimension-al grouped data,we allow the set of important group variables to vary with the quantile.Theoretically,our group screening method also has sure screening prop-erty under relatively weak conditions.Compared to other group selection methods,our quantile-based adaptive screening method has better finite sample performance Finally,we present the advantages of our approach through a genetic pathway data analysisThe screening methods of this paper enrich the feature screening research of ultra-high-dimensional longitudinal data and ultra-high-dimensional grouped data,which will help to select important variables or group variables in various fields such as genetics,biomedical imaging and economics,and achieve the goal of increasing calculation speed,streamlining models and improving prediction accuracy.
Keywords/Search Tags:Ultra-high-dimensional, Feature screening, Additive model, Nonparametric independence screening, Sure screening property, Linear model, Variable selection, Group variable selection, Longitudinal data, Sparsity, Quantile
PDF Full Text Request
Related items