Font Size: a A A

Quantile Regression Of High-dimensional Varying Coefficient Model Under Memory Constraint

Posted on:2021-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y K LiangFull Text:PDF
GTID:2370330611497972Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
In recent years,the hot issue of statistical research has focused on the high-dimensional data analysis sector.From the perspective of data,the characteristics of the data can be summarized as the massive sample size and a large number of features.Corresponding to the frequency and the amount of things people observed increase,the angles and channels are also expanded.These characteristics are manifested in all kinds of fields in economy,society,and science.People are paying more and more attention to the role of data,and the development of such a field becomes data-driven.It prompts the adaptation to computer application and statistical methodologies as tools for the requirements of the times.In the context of big data,hardware infrastructure,efficient calculation methods,and statisti-cal analysis methods are complementary.Hardware facilities,such as computer memory,computer processing ability,channels for obtaining data,and media for storing data,take effect on the application of the latter two.The calculation method determines the time consumption and stability of the data analysis process.Statistical methodologies ensure the accuracy of the results.In fact,different statistical tools are also subject to the char-acteristics of samples.In high-dimensional massive data,traditional excellent methods often cannot be effective.These limitations pose challenges for data scientists.Large-scale data analysis tasks often require support from high-quality advanced platforms,fast calculation methods,and appropriate statistical methodologies.However,the fact is that such kind of data hinders the processing of statistical models by computers.Generally,computers cannot handle large-scale data due to limitations of memory.This is futile for the high-frequency,high-dimensional data that is acquired con-veniently and stored cheaply in various fields nowadays.It results in the inability to utilize the information contained in the overall data promptly on time.Such as biomedical data,the number of genes reaches tens of thousands.In other words,the financial transaction data is with very high frequency.Another example is the constantly expanding data types,such as text,images,audio,and video,which come from all aspects of social life.And their storage and processing take great requirements on the level of the processor.Data itself brings challenges to data analysis.Thus the problem of model construction under realistic constraints arises.In contrast to individual applications,in the case of limitedpersonal computer memory,how to adjust the model so that it can be competent for the given task of analysis under constraints becomes a challenge.An important research field from high-dimensional data is variable selection because the data is often redundant.It means that the features collected from multiple perspectives are correlated,many noises drown the signal,and most of the samples are not homologous.The noise causes many missing data,outliers,and heterogeneity,which brings many ob-stacles and instability to the analysis.What we hope is that useful and interesting features are extracted from the collected data to construct the analysis.It is also easy to use the analysis results to guide future work due to interpretability.Preparation often occupies a large part of the entire analysis process.For applications,after these treatments are done,the analysis will fall into place.Occam's razor principle gives that in all possible alter-native models,we should choose the one which can explain the existing data better with a simple form.The principle of sparsity favors that although many signals of feature di-mensions are collected,only a few are truly useful.Based on these principles,we would like to select features,which reduces the difficulty in solution and occupation in space,and brings a model easy to interpret naturally.From the perspective of regression methods,quantile regression has the irreplaceable characteristics of least squares regression.The point is that quantile regression can handle heterogeneous data,which loses the requirements for error distribution and has certain ro-bustness.Because in applications,many errors reflect the characteristics of heavy-tailed.Least squares regression is often disturbed in the complex error,and the performance is not satisfactory.As a basic extension of ordinary linear models,varying coefficient models have become a powerful tool for high-dimensional data analysis.In the nonpara-metric models,the structure of the varying coefficient model is simple,and the implicit coefficient of the model continuously changes under the state variable.Thus it is also interpretable.The combination of quantile regression and varying coefficient models can often take advantage of both.Applying big data to quantile varying coefficient regres-sion models also often requires the variable selection aforementioned.On the one hand,the computational complexity of the solution is reduced,and on the other hand,concise and usable results are obtained.One approach is to use an appropriate penalty function to solve the optimization problem in regression.Many models use the LASSO method to construct the L1 penalty function to obtain a refined model.Although LASSO lacks the Oracle property,many researchers have also proposed some extended models based onLASSO.The parameter estimation has the Oracle property with consistency.Review the problem of high-dimensional massive data mentioned at the beginning,various numerical optimization algorithms have been studied in the field of computer sci-ence and computational mathematics to fit the problem.The researchers put forward the idea of distributed computing,that is to divide a problem that requires a very large com-puting ability to solve into many small parts,then allocate these parts to multiple com-puters for processing,and finally combine these calculation results to get a final result.Studying these numerical optimization algorithms,this article introduced the coordinate descent method and the ADMM method based on the framework of quantile regression,both of which can handle the optimization calculation problems brought about by data segmentation.In contrast,the latter can be easier integrated into a distributed framework.Statisticians naturally have their own methods to deal with the problem in limited mem-ory space of machines.Random sampling is a method that does not utilize complete data.This article also focuses on introducing block estimation,which shares the same idea as the divide and conquer algorithm.The research about block estimation mainly lies in how to aggregate the results of subset data.It makes the aggregation result more suitable for the result from complete data under non-block computing.Median selection,majority voting,and significance tests provide solutions for the aggregation of subset results.To be brief,when subjects encounter big data and machine memory is constrained,researchers all deploy splitting methods to solve problems,and integration schemes are similar.Based on the many pieces of research in distributed computing,our article focuses on the construction of distributed varying coefficient quantile regression models under memory constraints mainly.The model can ensure the deployment of calculation and the selection on features.The two-step procedure of LASSO selection method of high-dimensional varying coefficient quantile regression and the idea of median selection of the block-based message algorithm in the least squares regression are combined to obtain the final distributed model.The former uses LASSO and adaptive LASSO for two-step screening in quantile varying coefficient regression.The original coefficients functionals are approximated by B-spline basic functions to a set of spline coefficients.First step controls the scale of the model.The real model is included under this scale,which can effectively select variables in high-dimensional data and reduce the dimension greatly.Taking the first step as the initial estimate,second step sets the weights and performs the adaptive process to obtain the final model.The selection process is consistent.The lat-ter splits dataset into sub-machines for regression,and finally the regression results are integrated by taking the median as index of decision to select variable.This is a method inspired the idea of divide and conquer.Our distributed model combines the advantages of the two and deploys a two-step variable selection procedure of varying coefficient quan-tile regression on the sub-machines.The integration process shows the idea of believing the same results from most sub-machines.The asymptotic properties show that the final aggregation is consistent.In the study of numerical simulation,we explored the performance of proposed dis-tributed model under different data,by setting and changing functional coefficients,co-variates distribution,error distribution,quantile levels,sample size,and full model dimen-sions.The experimental results performed well in variable selection,estimation accuracy,and calculation consumption.We found that the model can select variables to get the most useful variables.Compared with the non-distributed model,the estimation accuracy of the function type coefficient is slightly lost,but the calculation time is greatly reduced.And it is not easily interfered by fictitious data and is summarized in the simplified model.Finally,the distributed varying coefficient quantile regression model is applied to the anal-ysis of the real dataset of the 2005-2006 India Demographic Health Survey to explore the factors that affect malnutrition of young children from 0 to 5 years old.The results show that the mother's height,years of mother's education,and the months of breastfeeding have impacts on the child's height.The influence changes with the growth of the child.In fact,from the results of numerical research,the model still has limitations.The performance of very heavy-tailed data is often not as good as the general data,which has demand for the robustness of the model.There is still room for improvement in the aggregation of results,identification of non-zero constant coefficients and varying coefficients,and calculation efficiency.
Keywords/Search Tags:massive data, varying coefficient quantile regression, sample splitting, B--spline, median selection
PDF Full Text Request
Related items