Font Size: a A A

Research On New Subsampling Statistical Learning Methods And Applications In Big Data Analysis

Posted on:2022-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:D LiaoFull Text:PDF
GTID:2517306494472984Subject:Statistics
Abstract/Summary:PDF Full Text Request
How to obtain valuable information from large-scale,high-dimensional data sets has become an important research area for big data analysis under the limitation of computing power.For large-scale and high-dimensional data sets,subsampling strategies and variable selection methods are very popular to downsize the data volume and improve computation efficiency.To deal with the computational bottleneck which is caused by the dramatical increase of data volume in the modeling and analysis procedure,in this thesis,we apply the two-stage subsampling strategy and variable selection technique to the large sample classification problem.Existing research efforts mainly define subsampling probabilities from the perspectives of minimizing the asymptotic mean squared error of estimators,gradient of loss function,and Hessian information matrix in the subsampling algorithms of large sample logistic regression.In the field of robust statistical analysis,samples with high leverage values are usually regarded as potential abnormal observations,and the fact that data points with high leverage values can improve the overall prediction effect of the model has been proven in the field of experimental design,and it is also has been fully verified in large-sample ordinary least squares approximation.Based on this research result,we normalize the leverage scores of the logistic regression model firstly,then define the subsampling probabilities that depend on the leverage scores,and propose a two-stage leveraging subsampling algorithm to solve the maximum likelihood estimate of the model.And then,considering the solution of the large sample logistic regression with sparse representation,we combine the subsampling algorithm with variable selection technique and propose a subsampling-variable selection algorithm for this case.Finally,we further turn our sights to generalize the importance subsampling strategy to the large sample Support Vector Machine(SVM).Despite of its good theoretic foundations and generalization performance,SVM is not suitable for classification of large data sets since it needs to solve the quadratic programming problem in order to find a separation hyperplane,which causes an intensive computational complexity.In view of the sparsity of SVM solutions and the geometric relation existing in the data points and decision hyperplane,the subsampling probabilities are defined according to the distances between the data points and the decision hyperplane,and we propose an importance subsampling algorithm to recover the more significant data points into training data sets to make SVM more practical in large data classification tasks.The results of simulation experiments implemented in different synthetic data sets have showed that: the leveraging subsampling algorithm performed well than the several subsampling algorithms mentioned in the existing research efforts while the class distributions of data are imbalanced;compared with the uniform sampling,the combination of importance subsampling algorithms and variable selection technique can improve the classification accuracy and the interpretability of the model;the importance subsampling algorithm based on the distances between the sample points and the decision hyperplane has a better performance in classification accuracy than the uniform sampling.
Keywords/Search Tags:Importance Subsampling, Leverage, Variable Selection, Logistic Regression Model, Support Vector Machine
PDF Full Text Request
Related items