Font Size: a A A

Research On The Performance Of Algorithms For Solving Generalized Linear Problems Under Large-Scale Data

Posted on:2021-05-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChaiFull Text:PDF
GTID:2427330602983572Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Generalized linear problems are a subclass of stochastic optimization problems which are very important in statistical and machine learning research.They are presented in the form of an expected risk minimization problem and can represent the process of model's parameter optimization in many regression and classification tasks,therefore have great value of research.Since the joint distribution of the model's input variable and the output variable is unknown in reality,the expected risk minimization problem needs to be transformed into an empirical risk minimization problem.However,the classical iterative algorithms for solving the empirical risk minimization problem are computationally intensive and inefficient under large-scale data(i.e.,the quantity of data is much larger than the dimension,and the dimension of data is mach larger than 1),resulting in high time costs.Under this background,we introduce a new algorithm for generalized linear problems in chapter 3:SLS algorithm.Under large-scale data,the computational cost of this algorithm decreases substantially compared to the previous algorithms on the basis of no significant reduction in accuracy.We give the theoretical basis of the algorithm,and introduce the algorithm steps in detail.The algorithm has two steps,and the second step is same as the previous algorithms,which needs to be completed through iteration.For the step of iteration,we make a comparison of time complexity between SLS algorithm and the previous algorithms.It can be seen that the time complexity of SLS algorithm is significantly better than other algorithms.The efficiency of SLS algorithm under large-scale data is commendable,but there is still room for continued improvement.The first step of the algorithm:the computational cost of least squares estimation is very large under large-scale data which will slow down the whole algorithm's efficiency.In response to this problem,we use the fast least squares method based on subsampling to reduce this cost,the method will be introduced in chapter 4.This method chooses a small portion of the entire data according to a certain proportion(called subsampling for short),then uses this part of data to calculate the covariance in order to reduce the computational cost in least squares.As a centralized manifestation of the fast least squares method,we will introduce three kinds of fast least squares estimations:(1)Full subsampling estimation;(2)Covariance subsampling estimation;(3)Uluru estimation.They are used as substitutes for the original least squares estimation to reduce the computational cost We use time complexity and error bound as indicators to compare the fast least squares estimations with the ordinary least squares estimation in a linear regression model,and theoretically affirm the improvement of the computational efficiency which is made by the fast least squares method.In order to clarify the optimization effect of the fast least squares method on the SLS algorithm,in Chapter 5 we use precision and running time as indicators to compare the three optimized SLS algorithms:FS-SLS method,CovS-SLS method,Uluru-SLS method,with the original SLS algorithm.In the part of simulation data,we compare each algorithm's performance in a linear regression model,a binary logistic regression model and a poisson regression model in which the input data is sample value from multivariate normal distribution and Bernoulli distribution separately,the subsampling ratio is selected from the set ?0.004,0.008,0.011}.The study shows that under the linear regression model,FS-SLS method and Uluru-SLS method can significantly reduce the running time of SLS method without increasing the error markedly.The running time of FS-SLS method is 68.2%of SLS,and the running time of Uluru-SLS method is 76.1%of SLS;Under the binary logistic regression model,when the components of the input variable are uncorrelated,as the amount and the dimension of the data increase,the error of FS-SLS method and Uluru-SLS method are gradually approach the error of the SLS,but the running time reduce significantly relative to SLS.in particular,when the input data is from Bernoulli distribution,the running time of FS-SLS method is only 47.7%of SLS,and the running time of Uluru-SLS is 49.9%of SLS;Under the Poisson regression model,and there are certain correlations between the components of the input variable,the error of CovS-SLS method is relatively minimal in the four methods,and the running time of that is 68.6%of the SLS.In the empirical part,we use Covertype data set in UCI Machine Learning Repository to establish a Poisson regression model to predict the type of forest cover vegetation,with a subsampling ratio of 0.15.The study shows that the predicting mean square error of Uluru-SLS method is not significantly lower than that of SLS method,but the running time of it is 84.4%of SLSOur conclusions are as following:When the sampling ratio is chosen from the set{0.004,0.008,0.011},in linear regression model and binary logistic regression model without correlations between the components of the input variable,FS-SLS method and Uluru-SLS method can improve the efficiency of SLS algorithm.In Poisson regression model with correlations,CovS-SLS method can improve the efficiency of SLS algorithm.When the sampling ratio is 0.15,Uluru-SLS method can improve the efficiency of SLS algorithm in Poisson regression model.
Keywords/Search Tags:Generalized linear problems, SLS algorithm, fast least squares method, subsampling
PDF Full Text Request
Related items