| Objective Fever with thrombocytopenia syndrome(SFTS)is a new infectious disease caused by thrombocytopenia syndrome virus(SFTSV),and China is the main epidemic area of the disease.Baidu search index and meteorological factors can assist in disease research.The purpose of this study is to establish a prediction model of SFTS incidence in Hubei Province by introducing Baidu search index(BSI)and meteorological factors and using machine learning,and to provide an important basis for the formulation of control intervention measures.Methods The SFTS case data,meteorological data and Baidu search index data of Hubei Province from January 2013 to November 2020 were collected from China Public Health Science data Center,China Meteorological data Network and Baidu Index official website respectively.According to the type and number of variables,the data is divided into four different data sets,namely time trend data set,meteorological factor data set,BSI factor data set and comprehensive factor data set.Based on the data from January 2013 to December 2019,different prediction models based on ordinary least squares(OLS),autoregressive moving average(ARIMA)and multiple machine learning models are established.SFTS case data from January 2020 to November 2020 were used for model evaluation.For the machine learning model,genetic algorithm,particle swarm optimization and Bayesian optimization are used to optimize the super parameters.R~2,mean absolute error(MAE),mean square error(MSE),root mean square error(RMSE)and symmetric mean absolute percentage error(SMAPE)were used to evaluate the model,and the predicted number of cases of the optimal model in four different data sets was compared with the actual number of cases.Select the optimal model and use SHAP to explain the model.Results The R~2,MAE,MSE,RMSE and SMAPE of OLS model on meteorological factor data set,BSI factor data set and comprehensive factor data set are lower than that of ARIMA model using time trend data set.On the other hand,the overall performance of the ARIMA model is higher than that of the machine learning model in the meteorological factor data set,while the overall performance of the ARIMA model is lower than that of the machine learning model in the comprehensive factor data set.In the BSI factor data set,there are about half of the machine learning models whose performance is higher than and lower than the ARIMA model.The optimal prediction model is based on the GA-GBDT model of machine learning algorithm.The data set used in this model is a comprehensive factor data set.The model R2,MAE,MSE,RMSE and SMAPE are 0.977,2.210,8.637,2.939 and 50.093 respectively.The importance of SHAP features shows that the contribution of factors to the model from high to low is keyword K3,keyword K1,precipitation,air temperature,humidity,keyword K2.The contribution of BSI keywords to the model is greater than that of meteorological factors.At the same time,the performance of most machine learning models is better than that of OLS model and ARIMA model on the same data set.The performance of each model on the four data sets from high to low is comprehensive factor data set,BSI factor data set,time trend data set,meteorological factor data set.Conclusion The BSI factor prediction model using machine learning method is better than the meteorological factor prediction model using the same method.However,using the common factors of the combination of the two will get a better prediction model in the case of machine learning.The overall performance of machine learning model is better than that of OLS model and ARIMA model.Accordingly,this study provides a new prediction tool for SFTS epidemic trend prediction and epidemic situation early warning in Hubei Province. |