Font Size: a A A

Inference For Large-scale And High-dimensional Data With A Split And Conquer Approach

Posted on:2021-04-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J R ZhangFull Text:PDF
GTID:1487306314455224Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the scale and dimension of data in life getting larger and larger,making accu-rate inference with low computational cost has become an important research direction worthy to pay attention to.Inference through de-biasing the penalized estimators is also a significant project,but the de-biasing procedure increases the computational cost by an order of magnitude close to the dimensionality compared with the initial penalized estimation.Therefore,we suggest a split and conquer approach to reduce the compu-tation cost and improve the computing speed.Moreover,we guarantee the accuracy of confidence intervals after partitioned approach is asymptotically the same as that us-ing the data all at once.Therefore,this paper mainly focuses on the inference for high dimensional data using a split and conquer approach.First of all,we focus on the random design case in high dimensional data to derive the asmptotical property and confidence intervals,and on this basis we establish the confidence intervals of the suggested partitioned approach.To improve the computing speed and keep the accuracy of inference,we separate the initial estimation and pro-jection steps by using the whole data to get initial estimator and partitioned data to get the de-biasing part,which reveals that the sample sizes needed for these two steps with statistical guarantees are different.Based on the work above,a refined version of confi-dence intervals is proposed to guarantee the accuracy of the confidence intervals when split sizes get larger.In the numerical simulation study,we analyzed two kinds of data set with different magnitude,one of them is big and the other is relatively small.We compared the average probability coverage rate,average length of confidence intervals and computing speed with different split sizes.The results demonstrate the validity of the proposed method and our refined version has extremly good performance.More-over,significant improvement in running times can also be seen as the split size gets larger.In the real data analysis,we also test the accuracy and computing speed of our confidence intervals under spllit and conquer method.It verifies the the validity of our method and at the same time the real data analysis shows the robustness of the spliting procedure on high dimensional data.Besides,we proposed a soft thresholding variable selecting method on the mean bagging estimator and derive the error bounds of it,which provides more space for our method to extend the application.Last but not least,we establish a bootstrap-assisted procedure for simultaneous inference based on the split and conquer method.We use different method to conduct simultaneous confidence intervals for sets with finite elements and large size of elements under splitting and conquer approach.We conduct the former based on the two version of confidence intervals mentioned above.As for the inference for sets with large size of elements,we use the bootstrap-assosted procedure.We also provide different algorithm to compute the bootstrap-assisted procedure.By using simulation data analysis and real data analysis to compare the performance of simulataneous confidence intervals with different split sizes,it shows that our method is reasonable and effective both in computing speed and accuracy of inference.
Keywords/Search Tags:Big data, Confidence intervals, statistical inference, De-biased estimator, Large split sizes, simultaneous confidence intervals, variable selection
PDF Full Text Request
Related items