Font Size: a A A

Data Review Of Pollution Source Census Based On Knowledge Graph And Machine Learning

Posted on:2021-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ShiFull Text:PDF
GTID:2381330605971516Subject:Chemical engineering
Abstract/Summary:PDF Full Text Request
From December 31,2017,the second national pollution source census will be implemented.The census data serves as the basis for formulating environmental protection policies and environmental protection plans,and data quality standards.Ten years ago,the first pollution source census was limited by technological development,data review was not in place,and the quality of the data was doubtful,resulting in the problem of insufficient application of census results and the waste of a lot of manpower and material resources.The application of big data analysis technology to the review of pollution source census data can solve the problem of technical limitations,efficiently and accurately point out companies with doubts about data quality,and narrow the verification range of households.Based on the actual needs of the census process,this article first selects the sample industry,and uses the data of the sample industry to build a logical relationship audit method model based on the knowledge graph and social network analysis method,based on the isolated forest algorithm(iForest),self-organizing mapping neural network algorithm(SOM)Establish a numerical data review method model.By visualizing the distance between neighboring nodes and verifying the feedback results on site,we find the optimal threshold of the algorithm and optimize the model.The model is evaluated through the calculation of accuracy rate,accuracy rate,recall rate,and F1 value(the average of the reconciliation of accuracy rate and recall rate).It is finally applied to the review of pollution source census data.The review process includes data cleaning,standard data set establishment,logical relationship review and numerical anomaly review,on-site verification,and feedback results.The results show that when the thresholds of the logical relationship audit method model are 0.5%and 70%in the data cleaning stage and the social network analysis stage,respectively,the key inspection indicators will not be missed and the audit results will be optimized.The first audit and the second review of nearly 8000 enterprises were found,and there were problems with the data of 1539 and 386 enterprises,respectively.The scope of on-site review was reduced by 80.74%and 95.16%.When the threshold of the numerical data review method model is selected to be 0.26,the review results are optimal,and the accuracy rate,precision rate,recall rate,and F1 value can reach:94.00%,95.95%,95.95%,0.9595,respectively.Audited more than 14,000 companies and found that 1095 companies had data problems,and the on-site review scope was reduced by 81.92%.The data review model established in this paper can accurately point out the suspected enterprises in the pollution source census,narrow the scope of on-site verification targets,and provide guarantee for data quality.At the same time,it also provides new ideas for improving the quality of data in the field of environmental statistics.
Keywords/Search Tags:pollution source census, data quality, knowledge graph, isolated forest, self-organizing mapping
PDF Full Text Request
Related items