Research On Statistical Data Accuracy Test Based On Benford-Boosting Method

Posted on:2022-09-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2480306476481994

Subject:Master of Applied Statistics

Abstract/Summary:

PDF Full Text Request

With the development of data collection and storage technology,while the amount of data is increasing rapidly,there will inevitably be a large number of abnormal data.The accuracy of data seriously restricts the quality of decision-making in the era of big data.The accuracy of statistical data has become a The key issues that people are concerned about.Literature research has found that Benford's Law,which has been applied for decades,has certain problems.This paper summarizes the advantages and limitations of Benford's law on the basis of reviewing domestic and foreign data quality accuracy testing methods and big data technology algorithms,introduces the combination of Boosting algorithm and Benford's law,and proposes a combined algorithm for testing the accuracy of statistical data.,Solve the problem that the traditional Benford law can only be positioned at the first digit of the data,and improve the method of checking the accuracy of statistical data.The research content,conclusions and innovation are mainly manifested in:(1)After studying the relevant literature of Benford's law and its application,it is found that the use of Benford's law may be "disproportionate to the requirements of the data size,only able to filter out the range of abnormal data,sensitive to changes in the data set,and difficult to adapt to the new era.Issues such as the specific timing of abnormal data and overall regularity requirements.It is necessary to further improve Benford's law to adapt it to the test research on the accuracy of statistical data.(2)In order to overcome the limitations of Benford's law,a Benford-Boosting algorithm model for statistical data accuracy testing is proposed.Firstly,the data is sorted into panel data containing time and region;secondly,the problem indicators are screened out through correlation coefficient comparison,and an abnormal data pool is established;third,the abnormal data pool is gradually eliminated to find the time points of the problem indicators,and use The distance ratio locates the specific area of the problem index,and then filters out abnormal data points,and other data points are set as normal data points;fourth,the random forest algorithm is added to select the relevant indicators of the problem index to form important index selections that contain normal data points and Data set of abnormal data points;Finally,the Boosting method is used for classification learning of data quality inspection.(3)This article selects the industrial sales output value,total assets,total current assets,total liabilities,total current liabilities,total owner's equity,main business income,main business costs,The data of 13 financial indicators,including sales expenses,management expenses,financial expenses,operating profits,and total profits,are empirically researched using the Benford-Boosting algorithm.Firstly,the data is sorted into "characteristic index time dimension" panel data,and the correlation coefficient method is used to determine that the operating profit index may be a problem index;secondly,the gradual elimination method and distance ratio positioning are used to screen abnormal data points.The study found that2012-2017 The regions where the first digits of the annual operating profit indicators are 2,3,4,and 6 have problems,these data are judged as abnormal data points,marked as 1,and the data of other periods and regions are set as normal data points,marked as-1,forming a data set containing 13 financial indicators(1);third,using random forest algorithm to sort the importance of indicators,in order of total profit,operating profit,main business income,main business cost,industrial sales output value,Total liabilities,financial expenses,total current assets,sales expenses,etc.,form a data set containing these 9 financial indicators(2);the fourth is to use the Benford-Boosting model constructed in this article and the Benford-decision tree model commonly used in the literature to target Data set(2)and data set(1)are used for classification learning.The accuracy of the model is 93.8%,87.5% and 79.1%,60.4%,respectively,which verifies the effectiveness of the Benford-Boosting model.In a word,the Benford-Boosting model constructed in this paper has a certain degree of innovation,and has strong theoretical significance and application value for improving the quality inspection method of statistical data.

Keywords/Search Tags:

Benford's Rule, Boosting Classification Algorithm, Data Accuracy Test

PDF Full Text Request

Related items

1	Gene Expression Data Classification Based On Boosting
2	Classification Of Hyperspectral Data Based On Multi-feature Combination By Multiple Kernel Boosting
3	Classification Methods In Data Mining And Their Applications To Mass Spectral Data
4	Research On Classification Algorithm Based On Fuzzy Rules
5	A New Algrithm Designed For Weighted Samples Classification And Some New Boosting Algrithms Designed For Classification Based On Additive Logistic Regression Model
6	Boosting Algorithm And Its Application
7	Determination of classification accuracy for land use/cover types using LANDSAT-TM, SPOT-MSS and multipolarized and multi-channel synthetic aperture radar (SAR) data
8	Research On Accuracy Analysis And Results Correction Of Land Type Interpretation In Moderate-resolution Remote Sensing Imagebased On Secondary Survey Data
9	Research On AQI Prediction Model Of Hefei City Based On Boosting Algorithm
10	Research On Data Classification And Application Based On Fuzzy Knowledge