Font Size: a A A

Optimal Subsampling Of Regression Model Based On Information Matrix With Big Data

Posted on:2024-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:W GuoFull Text:PDF
GTID:2530307085467844Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,the scale of data is increasing at a geometric progression rate,and big data has become an important research object in modern data analysis.The availability of massive data has reshaped traditional data analysis and theoretical research.The unprecedented amount of data has made traditional data analysis methods no longer applicable.Based on the above background,this paper mainly studies the subsampling algorithm of quantile regression model and Logistic regression model in massive data.The first part mainly introduces the research background and significance of this article,and summarizes the current research status of subsampling method,IBOSS method,and divide and conquer IBOSS strategy in big data analysis.The second part studies the subsampling algorithm of quantile regression model with big data.Firstly,this article proves that any estimation based on random subsampling has a matrix inversely proportional to the subsample size,so that the covariance matrix of the subsample parameter estimation has a lower bound under the Loewner order;Secondly,this paper proves that the determinant of the information matrix of any estimate based on sub samples has an upper bound.Based on its conclusion,an IBOSS algorithm based on Doptimality criterion under quantile regression model is proposed;Afterwards,a large number of simulation experiments were designed to compare the performance of IBOSS algorithm,two-step sampling based on L-optimality criterion,and uniform sampling in different scenarios;Finally,three methods were further applied to empirical analysis of airline flight data with a sample size of 7009728,and through comparison,it was found that the IBOSS algorithm still performs well.The third part studies the subsampling algorithm of Logistic regression model with big data.Firstly,the IBOSS algorithm for logistic regression was introduced,and combined with the idea of divide and conquer,the DC-IBOSS algorithm was proposed.It can perform analysis on very large datasets in both distributed computing and personal work scenarios.Secondly,this paper proves that when the sub sample size is fixed,the total sample size tends to infinity,and the covariates are normal distribution,at least one eigenvalue in the information matrix of the sub samples selected by DC-IBOSS algorithm will tend to infinity.Afterwards,through extensive numerical simulations,it was found that the number of partitions had a minimal impact on the estimation performance of the DC-IBOSS algorithm,and the DC-IBOSS algorithm was significantly better than the divide and conquer uniform sampling;Finally,the performance of DC-IBOSS was evaluated on a data set with 11 million Higgs boson production processes.It was found that DC-IBOSS had higher precision and could adapt to distributed scenarios in big data analysis.The fourth part further extends the idea of combining IBOSS with divide and conquer to quantile regression.Firstly,the DC-IBOSS algorithm based on quantile regression is proposed;Finally,the performance of DC-IBOSS algorithm and divide and conquer uniform sampling in different scenarios was compared through numerical simulation.
Keywords/Search Tags:Big data, Quantile regression, Logistic regression, D-optimality criterion, Information matirix, Divide-and-conquer
PDF Full Text Request
Related items