Font Size: a A A

Statistical Computation And Theory Research Of Sufficient Dimension Reduction In Big Data

Posted on:2022-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:M CaiFull Text:PDF
GTID:1487306773483714Subject:Information and Post Economy
Abstract/Summary:PDF Full Text Request
The curse of dimensionality was originally proposed by Richard Bellman in 1996,which refers to all kinds of issues related to different scientific areas caused by analyzing data in high-dimensional spaces.One efficient approach in the presence of the curse of dimensionality is to project high dimensional data onto low dimensional space without loss of information.Sufficient dimension reduction is noted as a prevailing methodology under such a framework,where linear combinations of predictors are searched for data reduction and visualization.Characterized by model-free and sufficient property,sufficient dimension reduction has been gaining considerable interest in the past thirty years.Meanwhile,the real world brings something much more than high dimensional data,but high dimensional data with huge sample sizes,i.e.so-called big data.Although sufficient dimension reduction could contribute a lot to dimension reduction,we may turn to modern computing techniques to handle problems associated with the magnitude of datasets with limited computing resources.Our paper is dedicated to study statistical computation and the theory of sufficient dimension reduction in big data,incorporating modern computing techniques and statistical methods to extend sufficient dimension reduction.In the first part,we consider implementing a family of sufficient dimension reduction estimation with an orthogonalizing EM algorithm.This family of sufficient dimension reduction aims at finding dimension space by constructing a series of transformation function-based responses.By regressing the newly transformed responses on the predictors,one can obtain the least-squares solution as the estimation of the central subspace.When the data is gigantic,such estimating procedure would be rather resource-intensive.Fortunately,orthogonalizing EM algorithm enjoys great advantages when dealing with large-scale ordinary least squares,which is essentially an iterative EM algorithm and could help to release the computation burden and applies well to penalized least squares.The application of this algorithm in sufficient dimension reduction is realized in the paper.The convergence of the proposed estimating sequence is established as well.Moreover,we introduce the least absolute shrinkage and selection operator(LASSO)penalty to achieve dimension reduction and variable selection simultaneously.Such useful application of orthogonalizing EM algorithm in sufficient dimension reduction is claimed to be highly efficient in the simulation study and real data analysis,reducing the computing time while maintaining well estimation of the central subspace.Although orthogonalizing EM algorithm fulfils the need for some least-squares based sufficient dimension reduction estimator for saving computation,there are still many other methods being faced with the same challenge.Divide and conquer is to analyze sub-datasets in several machines and combine the results at last,which eases the necessity for high-performance computers.A new slicing-free method named WIRE presents capacity with multivariate responses in discovering central subspace under the framework of sufficient dimension reduction,which is taken as an example to incorporate the divide and conquer technique.Relevant asymptotic properties are established and the efficiency of divide and conquer based sufficient dimension reduction is proved in the second part of this paper.As well,we use some simulation studies and real example analysis to illustrate our method and to display the superior performance of the proposed estimator.However,observing from the previous part,there exists a slight difference between the estimation accuracy of divide and conquer based estimator and estimator by the whole sample.It occurs because some information will inevitably be neglected since each machine only has access to a limited proportion of the data and there is no interaction between different machines.In the third part of our paper,we introduce empirical likelihood to compensate for such loss utilizing auxiliary information in the form of general estimating equations,and further elaborate how to incorporate empirical likelihood in sufficient dimension reduction,which is proved to be with higher accuracy.The asymptotic properties of empirical likelihood based sufficient dimension reduction estimator using divide and conquer is studied.Besides,it is difficult for us to obtain true auxiliary information of the population,but we could simply use an estimation of the full sample data to substitute such information.The possibility of whether to use the estimation of auxiliary information as an alternative could also improve the estimation of the central subspace is explored as well.Simulation studies and real data analysis show the performance of our method is improved when compared to the estimator based on divide and conquer only.
Keywords/Search Tags:Sufficient dimension reduction, Big data, Orthogonalizing EM algorithm, Divide and conquer, Empirical likelihood, Auxiliary information
PDF Full Text Request
Related items