| The rapid development of the Internet has resulted in rapid growth of data accumulated on the Internet and most of them are related to spatial location.In the era of Web 2.0 when users participate in content creation,they spontaneously create geo-spatial content which includes user access records on the Internet.The search history of search engine includes search keywords,search time and the IP address corresponding to a geographic location,which means search engine data is a typical source of geospatial big data.It is open,ubiquitous,and near real-time so that it can help solve those problems that traditional data cannot solve.The application of search engine data in the field of disease surveillance is a classic case of big data applications.When the Google flu trends was first issued,it has attracted widespread attention,and many scholars have followed it.Previous researches mainly focus on the time characteristics of search engine data.There are few studies focus on the spatial distribution characteristics of web search behavior so the spatial aspect of search engine data has not been fully used.This paper studies in this aspect and its contents include:(1)Researches on the search engine data acquisition method.Taking Baidu Index as an example,this paper introduces the framework of auto-crawler using Python.This website does not provide API for direct data access.The data is not represented by static text but interactive charts and the figures are encrypted by collage of pictures which means it is difficult to collect data.Selenium,a Python package is used to simulate the user’s input,selection and hover.After moving the mouse to the position of the index and capturing the screen,the image recognition package can be used to obtain the search index of a keyword.(2)Researches on data preprocessing methods,including the selection of keywords and the processing of multicollinearity between keyword’s search indices.Keywords from relevant researches and recommended by keyword mining tools are considered to be the initial scope.Then,keywords that are highly related to real flu cases are selected through correlation analysis.Stepwise regression and principal component analysis are conducted to solve the multicollinearity problem and when and where they can be useful is discussed.(3)Model the relationship between the related keyword’s search index and the real flu cases and their variation with time and space,and this model is used to provide near real-time estimates of the spatial distribution of influenza.Previous studies have pointed out that there are spatial differences in web search behavior.When researchers research many regions at a time,they usually model each region separately.Linear regression models based on ordinary least squares are commonly used.Considering the similarity between a spatial unit and its surrounding unit,this study models multiple research areas simultaneously and consider the distance decay effect.Ordinary least squares regression(OLS),geographically weighted regression(GWR),and geographically temporally weighted regression(GTWR)are conducted and their fitting results and monitoring results are compared.It is found that the GTWR model which takes the non-stationarity of time and space into account is the best.This method can be used as a supplement to traditional disease surveillance methods.The combination of GTWR model and search engine data can identify high-influenza regions and monitor the spatial distribution of influenza at near real time,it can also provide predictive models and statistical interpretations for spatial epidemiological studies. |