Font Size: a A A

Methodological Study On List Screening Of Pollution Sources Survey Based On Big Data

Posted on:2020-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:M LuFull Text:PDF
GTID:2381330590495229Subject:Environmental Science and Engineering
Abstract/Summary:PDF Full Text Request
In order to strengthen environmental supervision and management,and to understand the basic environmental information of various enterprises and institutions,China conducted the first national survey of pollution sources in 2008.Summarizing the first national survey of pollution sources,its development is limited by historical background and data analysis capabilities,and there are still many deficiencies.In the list screening stage of the pollution source survey,the government department selects the list of basic units according to the enterprises'industry classification code as the basic inventory for physical inspection.However,the data provided by government department is incomplete and there are also a large number of errors in the industry classification codes used for screening.This will result in a lot of non-target industry enterprises in the inventory of basic units.At the same time,due to many practical reasons,there are also a large amount of target enterprise information that is not contained in government department datas,resulting in the basic list of industrial pollution sources to be inaccurate.Recently,the second national survey of pollution sources began.Therefore,this study hopes to use big data and related technologies to identify and correct industry categories through business data,and to use the Internet big data technology to supplement the basic unit list.Optimize the data processing flow of the pollution source survey list screening stage and improve the construction efficiency and accuracy of the basic unit list.First,the study evaluates the data provided by government department and builds a machine learning classification model.Based on this,according to the basic idea of machine learning to deal with practical problems,construct standard data sets and verify their accuracy,compare and analyze different classification algorithms,and use them optimally.Then use the constructed calibration data set as the training set to predict and classify the national industrial data,provincial industrial data and the city industrial data provided by the government departments,and use the actual physical inspection feedback and other supplementary experiments to verify the accuracy of machine learning model.The results show that the naive Bayesian classification algorithm performs well,and the actual feedback test shows that if the F1 value is used as the evaluation index,the F1 values of each data set increase by 32.92%,21.42%,and 14.91%,respectively.The supplement experiment,compared with the original government department dataset,shows the F1 value increased by 151.06%,213.45%and 132.13%,respectively.The improvement effect was more significant,which verified the accuracy of the calibration data set and the machine learning model.Secondly,based on the Internet multi-source big data acquired by the third-party team,the data is evaluated and filtered through the general analysis principle of big data availability.The above-mentioned available classification prediction model is used to classify and predict the filtered Internet data.Verifing the accuracy by using the physical inspection feedback and other supplementary experiments,analyzing the data quality,and the feasibility of Internet big data for the addition of basic unit list.The final accuracy of the supplementary data is17.26%.Combined with the actual work situation,through supplementary experimental analysis,it is determined that the contribution of Internet supplement data to the basic list of enterprises should be 4.54%-16.85%.The horizontal and vertical comparisons of the obtained classification results show that the Internet data has a more obvious homogenization phenomenon than the departmental data,and the low proportion data in the Internet data is more obvious than the high proportion data.For the accuracy of specific industry classifications,the high proportion data has a higher accuracy rate,and the low proportion data has a larger gap and lower availability than the departmental data.Combined with the specific objectives of the list screening stage,Internet supplemental data can play an important role in the retrieval of missing enterprises,and can effectively broaden the access of acquiring data.Finally,we proposed a optimization method of screening the basic unit in the pollution source survey list screening stage,based on the use of the business scope of the enterprise to correct the industry classification,the use the Internet big data to supplement the missing enterprise information,and combined with the requirements of the actual work of the pollution source survey.The optimization method will provide a reference for the second national pollution source survey and other future environmental statistics work.
Keywords/Search Tags:pollution source survey, list screening, big data, machine learning, optimization, method
PDF Full Text Request
Related items