| With the rapid development of China’s economy,haze often occurs and PM2.5 is the main pollutant,so environmental protection is urgently needed.China has its own environmental monitoring system,but a large amount of data is not fully used.It is very meaningful to use historical data to predict PM2.5 concentration.This can help people avoid pollution in time and also help the government have enough time to manage.The main research work of this paper is as follows:Chapter 1: Research background,research status and research process.Chapter 2: List of research theory used in this paper,including statistical learning,linear regression,Naive Bayes theory and model of evaluation index,etc.Chapter 3: Data acquisition and data preprocessing.The data was shared by UCI and the time span was from Jan 2,2010 to Dec 31,2014.Including time,temperature,pressure,wind speed and other variables.Clean data,check data consistency,deal with missing values,etc.Data reprocessing makes data better adapt to the model.Chapter 4: Model building.One is a multivariate linear regression model for different seasons,and the other is a Naive Bayes model for predicting severe polluted weather.(1)Improve the traditional multiple linear regression model.The higher the score of the model,the better the performance of the model.The model score of traditional multiple linear regression is 58.732,the score based on thermograph optimization is 65.987,the model score after iterative feature selection is 69.657.Finally,the model discussion in different seasons,the winter model scored 93.985.(2)Naive Bayes classification is used in the study of severe polluted weather.It has been proved by many experiments that the model parameters are the best after removing the time factor.The recall of the optimized model in predicting abnormal weather was 0.79.It shows that nearly 80% of the abnormal weather can be accurately identified,so the model is applicability.Chapter 5: Summary and outlook.Both of the final models are useful,however,the unbalance of the original data set affects the precision of the model in predicting severe polluted weather.The proportion of non-severe pollution weather data is so large that the model has a preference in classification.In order to solve this problem,this paper also discusses the method of model optimization based on non-balance data set,and gives a feasible research idea for future research. |