In recent years,air pollution,led by smog,has become an important issue affecting the people's livelihood,especially the northern cities represented by Beijing and Harbin.Haze usually directly affects urban traffic in a way that reduces visibility.Particular Matter less than 2.5 ?m(PM2.5)is more harmful to the human body.Predicting the concentration of PM2.5 not only allows everyone to know in advance the effects of outdoor particulate matter on human health,but also provides the initiative for the relevant units to provide early warning,control,and control of air pollution.Domestic and foreign scholars have done a lot of research on the prediction of air quality related to PM2.5 and have achieved good research results.The representative regression methods used in these prediction methods include Gradient Boosting Decision Tree(GBDT),Support Vector Machine(SVM),and Logistic Regression(LR).In order to improve the generalization ability of the model,most of these methods allow the model to learn as many types of samples as possible,ignoring the correlation between the training set data and the problem to be predicted,resulting in a large proportion of noise samples in the training set.To a certain extent,the deviation of the prediction result increases.The intuitive performance is that in the long-distance prediction task of more than 24 hours,the prediction error rapidly increases with the gradual disappearance of the time gradient.Especially in northern regions where pollution is frequent and conditions are variable,the results of long-range prediction are often not referenced.At present,the missing value complement methods used in PM2.5 prediction mainly include cubic spline interpolation,linear interpolation,and mean interpolation.Although these interpolation methods are classical interpolation methods,they are susceptible to errors when the long-distance data are continuously missing.The influence of the transmission,resulting in an unsatisfactory interpolation effect.In order to solve the above problems,we take the sample similarity between the training set data and the problem to be predicted as the basis for constructing the training set.We propose a sample approximation algorithm(SA)from the perspective of unbalanced learning and approximation methods,and implement the SA-based pm2.5 prediction algorithm in combination with the existing regression algorithm.In the stage of data preprocessing,aiming at the problem of long-distance continuous loss of original data,a repeated marker interpolation(RMI)method based on multi-layer perceptron(MLP)regression interpolation is proposed.The specific work of this article is as follows:First,the research status of PM2.5 prediction algorithm is given and analyzed.Many prediction methods have been proposed for PM2.5 prediction,some of which have been well applied in some areas.However,these methods mostly seek the universal model at the expense of the local accuracy rate of the model,and lack of consideration of the correlation between the training set sample and the specific problem to be predicted.At present,the classical interpolation method used in the PM2.5 prediction study is susceptible to error transfer in the absence of continuous long-distance data,and the interpolation effect is not ideal.Second,RMI interpolation method is proposed.We repeatedly perform the operations of missing value marking first and then missing value filling on the original data until there is no marking data in the original data.According to the specific data loss situation corresponding to each hour during each filling,a certain amount of missing data in the vicinity is selected to construct a feature space,and then the MLP model is trained and the missing data is predicted.Third,a sample approximation algorithm sa is proposed.In order to distribute the samples in the training set centered on the problem to be predicted,and to aggregate the problems to be predicted,we first define a simple sample similarity calculation method based on the Euclidean distance and the time distance,and then sample according to the similarity distribution of the training set sample and the principle of receiving rejection sampling.In the sampling process,the weighted sum of the uniformly distributed sampled probability and Gaussian distribution sampled probability corresponding to the similarity of each sample is used as the probability that a single sample is sampled,so that the sample with higher sample similarity is more likely to be sampled.The more dense the sample area is,the lower the probability that a single sample is sampled.By repeatedly performing the approximation sampling and gradually lowering the probability that the low similarity samples are sampled in the process,the training set sample distribution gradually approaches the problem to be predicted,and finally,a stable prediction result or the maximum number of iterations with successive iterations is predicted.The result is output as the final prediction.Fourthly,the experiment compares the error of repeated mark interpolation and common interpolation methods on the problem of continuous data missing,and combines the SA algorithm GBDT,SVM and the use of GBDT,SVM,MLP and other prediction methods in the long-range prediction task on the forecast error.We first simulated the continuous loss of data on real air quality data,and then compared the root mean square error of RMI,linear interpolation,mean value difference and cubic spline interpolation in the case of continuous loss of various lengths.the experimental results show that the error of repeated marking interpolation is lower compared with the other three classical interpolation methods when the length of continuous loss of data is larger.Then,the prediction errors of GBDT,SVM combined with SA algorithm and single GBDT,SVM and other conventional methods on Beijing's air quality data set from September 2016 to February 2018 are compared.the experimental results show that the root mean square error(RMSE)and percentage absolute error(MAPE)of SA method are lower than those of GBDT,SVM,LR and other algorithms in the 24-hour,48-hour,72-hour,and 96-hour prediction tasks. |