| Chronic obstructive pulmonary disease(COPD)is a common chronic respiratory disease that has a high morbidity and mortality.COPD has a serious impact on the patient’s life and causes a certain degree of economic burden and social burden.Some related studies have shown that COPD is associated with harmful particles(such as PM2.5)and harmful gases(such as SO2,NO2,CO,etc.)in the air.In order to predict the number of newly admitted COPD patients per week,the effect of the average concentration of PM2.5,SO2,NO2 and CO on the number of newly admitted COPD patients was studied by using data mining and machine learning.Then a model was established to predict the number of new COPD patients per week.The appropriate air pollutants were selected as the predictor variables and the Mean Absolute Percentage Error(MAPE)was used to measure the prediction accuracy of the model.A K-means-based locally weighted linear regression(LWLR)combination model with high prediction accuracy was constructed by means of exploration and comparision.The main research work and achievements are as follows:(1)Several regression analysis methods were studied and analyzed to predict the number of newly COPD patients per week.Regression analysis is a supervised learning algorithm used to characterize the relationship between predictor variables and target variable,which is equivalent to the function mapping between variables.Regression analysis has two stages,the first stage is the training of the model,also equivalent to the function fitting;The second stage is the prediction,based on the trained model to predict the new data.The prediction accuracy of several regression analysis methods on the test set were compared,and the prediction accuracy of the classification and regression tree(CART)algorithm is the highest(13.36%).(2)A combined model based on LWLR is proposed.LWLR is an instance-based nonparametric learning algorithm with good performance in terms of prediction.In many machine learning methods,the predictive ability of the combined model is often stronger than the single model.Two combination models were studied.One is to combine the LWLR models with different kernel functions and weight the prediction value of each model.The other is to dynamically encode different kernel functions to form a new kernel function.These two combinations of models have improved prediction accuracy compared to a single LWLR model.The predictive error of a single LWLR is 13.49%,while the prediction error of the first combination model is 13.34%,and the second model is 13.38%.(3)A combination model of LWLR based on binary tree of training set is proposed.Since the LWLR model must first traverse the samples of the training set for each prediction,then only a small number of samples work in the case of regression prediction.As the number of training samples increases,the amount of computation increases.In order to reduce the computational complexity,the binary tree of training set was built based on the idea of CART.This method not only greatly reduces the computational burden,but also reduces the prediction error of the model(13.13%).(4)A combination model of LWLR based on K-means clustering algorithm is proposed.There is a problem of losing the samples needed by test point when using the combination model of LWLR based on binary tree of training set.In order to solve this problem,K-means algorithm was used to divide the training set into several subsets,and the required sample set according to the actual situation was adaptively selected.This method greatly improves the calculation speed and the prediction accuracy of the whole model.When the number of clustering classes is 11,the prediction error of the model on the test set is 12.08%. |