| Wage income has become the main source of residents’ income.With the diversification of employment and income sources,it is difficult to investigate the wage income of residents.In reality,there are many phenomena in the household survey,such as the random bookkeeping of the investigated households and the weak sense of responsibility.Therefore,the accuracy of the wage income survey data is difficult to be guaranteed,thus affecting the per capita disposable income of the region The level of judgment,especially the survey of rural residents’ poverty alleviation in the form of targeted poverty alleviation.Nowadays,people ’s life and the network have been inseparable,resulting in a large number of electronic data,the network crawler is an important method to obtain and collect these data,and how to make the network crawler more meet people’ s personalized needs,and improve the inconvenience of the network crawler in use,how to apply it to various industries to solve the practical application problems,is the current scholars Research focus.In order to solve the above problems,this paper proposes an algorithm to determine the accuracy of household wage survey data based on web crawler.First of all,this paper studies the technology of acquiring network data by web crawler,introduces the concept and research background of web crawler,and the general method of acquiring network data by using python web crawler.In this paper,we focus on the technology of using re library for data information extraction in python.However,at present,there are still some problems in the technology of acquiring network data by web crawlers,such as the dynamic web pages are difficult to parse,the speed of web crawlers is slow,and the content of web crawlers is inaccurate.In order to solve these problems,this paper proposes a set of multi-threaded network data acquisition algorithm based on selenium.This algorithm applies selenium library which is used in python to run and operate browser automatically,and solves the problem of obtaining dynamic and static page data information.Using the browser without interface,multi-threaded web crawler technology and keyword discrimination program,the speed of web crawler and the accuracy of contentgrabbing are improved.Based on the algorithm of network data collection,this paper proposes two methods to examine the data of household wage survey,namely,the most value discrimination and the "3 σ" rule discrimination.The most value discrimination first obtains the wage information matching the household in the talent market website(in the form of range),and then obtains the mode of wage range,which is expressed as(P,q).After dividing and recalculating the scope,we can get the maximum value of the salary data obtained by the talent market website,and take the maximum value as the criterion for the accuracy of the household salary survey data;"3 σ" rule discriminant method for the residents who have the corresponding company and position recruitment information in the regional talent market website,we can directly obtain the target company and position in the regional talent market website by using the web crawler technology.The salary information is used as the accuracy criterion.For the households without corresponding company and position recruitment information,the optimal decision tree algorithm is firstly used to classify the companies with target positions in the target talent market website,and then EM algorithm is used to fill in the missing value of the missing companies in the way of category wage mean and variance.Then,assuming that the salary data of the target positions in each category obey the normal distribution,the accuracy judgment range is obtained according to the "3 σ" criterion.This paper selects 4 administrative villages in a city and makes a household survey of 56 poor households by means of sampling survey,and analyzes the survey results.The above two audit methods are applied to the actual survey of poor households.Through comparative analysis,it can be seen that the "3 σ " rule discriminant method is better than the most value discriminant method in performance,which proves that the method has strong practicability. |