| As urbanization becomes a global development trend,vehicle ownership is increasing and with it the number of traffic accidents and casualties is increasing tremendously,which brings great loss and distress to society and individuals.At the same time,urbanization and increasing traffic congestion have also promoted the development of connected vehicle technology.However,in expanding the scope of auto insurance business,companies often focus on improving the profit model and reaching sales targets,thus neglecting the identification of high-risk drivers and the control of claims costs.Therefore,whether from the perspective of social security or from the perspective of enterprise car insurance operation and car network development,risk identification of drivers is an important research direction.In addition,in the context of the big data era,it has become easier and easier to obtain massive user data,and machine learning algorithms have also developed relatively mature,and the technology for customer risk identification has become more and more advanced.However,there are still few studies on driver risk identification and control,which are more based on theoretical analysis,and there is still much room for research on the use of data for quantitative classification and prediction of driver risk.Therefore,this thesis focuses on four major aspects of driver risk classification and prediction based on the personal characteristics of the accident driver,vehicle characteristics,road characteristics and driving environment.Firstly,we select the characteristics that may have an impact on driver risk according to the reference literature and experience,use a reasonable way to deal with the missing values and outliers,and code and standardize the variable values.Subsequently,a random forest model is used to rank the importance of the selected features,and the features with importance greater than 0.01 are selected as input variables for the model.Then the imbalance problem of the data was processed,and the data were balanced using SMOTE oversampling,Near Miss undersampling,and SMOTETomek mixed sampling,and the original data set and the three balanced processed data sets were brought into the CART decision tree,the base classifier was the CART decision tree with Adaboost integrated learning and ANN The original imbalanced dataset and the three balanced datasets are brought into the CART decision tree,the Adaboost integrated learning and ANN deep learning models with CART as the base classifier,and the Easy Ensemble imbalanced classification model based on Adaboost integrated learning for the imbalanced dataset for training,and the models are evaluated using a test set.By comparing the effects of various models,it is found that the classification effect of the original imbalanced dataset directly brought into the model is similar to that of random classification,which is extremely poor,while the three balancing adoption methods and the Easy Ensemble model all solve the problem of imbalanced data not being able to make effective classification prediction to a certain extent,and the G_means,F_measure,and AUC values of the model have been The G_means,F_measure and AUC values of the models are effectively improved,and the high-risk drivers can be effectively identified.In general,SMOTETomek-Adaboost has the best prediction classification effect,and the AUC of this model can reach 0.94.This thesis suggests that this model be applied to the problem of identifying and classifying high-risk drivers. |