Font Size: a A A

Variable Screening And Clustering For High Dimensional Mixture Model

Posted on:2023-01-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:W DongFull Text:PDF
GTID:1528306620951769Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The massive data brings in heterogeneity,owing to it usually containing data from different sources.The finite mixture model is an effective power tool in analysis of heterogeneous data that may include two or more subgroups within a population.Recently,the literature which studies the finite mixture model mainly focuses on statistical inference with the fixed dimensionality.In practice,ultrahigh-dimensional data are frequently encountered in many fields,such as genomics,biomedical imaging,and economics.The dimensional of features(variables)may grow exponentially with the sample size in ultrahigh-dimensional data analysis,a direct implementation of a mixture model often leads to a poor prediction.To cope with the ultrahigh dimensional mixture model,it is often reasonable to assume that only a handful of variables are relevant to the sub-population analysis or cluster analysis.With this sparsity assumption,feature screening is widely utilized to remove most irrelevant features associated with sub-population before an in-depth analysis.In the literature,plenty of works of marginal-effect-based methods have been proposed in this area in recent years.Despite the rich literature on feature screening,the existing methods are mainly developed based on a direct measurement of correlation between the response and the data features.They may not be directly applicable to the mixture model,where the response is implicitly linked to the features via the unobserved class labels.Thus,the study of the mixture model has great significance in terms of theoretical analysis and real applications.This dissertation proposes an EM-based hybrid hard-soft(HHS-EM)variable screen-ing procedure for ultrahigh-dimensional LCA.The consistency of parameter estimation,sure screening property,and the consistency between misclassification rate and optimal misclassification rate of LCA are studied.And,an EM--based alternating adaptive hard thresholding update(AHEM)method is proposed for solving a high dimensional gaussian mixture model.Under some mild conditions,the consistency of parameter estimation,and the consistency of misclassification rate to optimal misclassification rate are established.Finally,for the nonparametric mixture model,a group variable screening method is considered and the sure screening property is studied.The performance of the proposed methods are illustrated by means of simulation studies and real--data examples.The main content is as follows:1.This dissertation proposes an HHS-EM variable screening method for LCA,in the high-dimensional covariate scenarios.First of all,we develop a hybrid hard-soft penalty for the likelihood of LCA.The benefit of using this hybrid penalty is clear:it attempts to combine the strength of L0and L1penalties to achieve more effective screening in an LCA.Next,for the parameter estimation of LCA,we develop a gradient descent method based on the alternating direction method of multipliers.Then,we show that the HHS-EM enjoys the sure screening property and leads to a refined LCA that is effective and consistent for high-dimensional classification under some mild conditions.Finally,based on the simulation study and the crime data analysis,the performance of HHS-EM is studied.2.This dissertation studies the sparse estimation problem with the mean vector and precision matrix of the high dimensional Gaussian mixture model.Firstly,we propose a new method that is built on an EM-based adaptive alternating hard thresholding update(AHEM)on the parameters of the gaussian mixture model.Meanwhile,the joint information carried in the Hessian matrix of likelihood is naturally accounted at each iteration as a basis for the next update,which further prompts the effectiveness of clustering.AHEM is easy to implement and fast to compute under restrict run-time analysis.Next,we show that the AHEM is consistent in both parameter estimation and misclassification error in high-dimensional cluster analysis under some mild conditions.Finally,the promising performance of the method is supported by both simulated and gene expression cancer data analysis.3.This dissertation considers a group variable screening method for the ultrahigh-dimensional nonparametric mixture model.It increases the flexibility and allows a nonlinear transformation of each predictor to be added into the regression model of subgroups,where the unknown transformed functions are estimated in a nonparametric manner.To begin with,by B-spline,we propose nonparametric group L0and group L1variable screening(NGS)for B spline coefficients.Then,we show that the likelihood function is guaranteed to increase after each iteration based on the NGS method and the NGS enjoys the sure screening property.Finally,the numerical performance of the proposed method is evaluated using finite sample simulations and ADNI-2 data analysis.
Keywords/Search Tags:Feature screening, Finite mixture model, Heterogeneity, Misclassification error, Ultrahigh dimensional data
PDF Full Text Request
Related items