Font Size: a A A

Analysis Of Factors Contributing To Imbalanced Crash Severity Based On CTGAN And Model Interpretability

Posted on:2024-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:2542307157470944Subject:Transportation
Abstract/Summary:PDF Full Text Request
The class imbalance problem of data refers to the fact that the volume of one type of data in the data set is extremely large,resulting in a serious imbalance in the sample proportion.Traffic accident data is a typical kind of imbalanced data: in general,the property damage only accidents account for the largest proportion,followed by minor injuries,and usually fewer serious injuries or fatal accidents.Traditional classification models tend to be more suitable for balanced data,and can only provide sub-optimal classification results in imbalanced data scenarios,i.e.,the models have higher prediction accuracy for the majority classes accident,but lower prediction accuracy for the minority classes accident.However,it is often the minority class of crashes that result in serious injuries or fatalities that are the focus of decision makers.Solving the problem of extreme imbalance of traffic accident severity is of great theoretical and practical significance to improve the prediction accuracy of the model and to deeply explore the influence mechanism of each characteristic factor.First of all,this paper takes two-vehicle rear-end accidents as the research object,and extracts a total of 164361 accident data from Chicago Data Portal from 2016-2020,and after merging,cleaning,and label coding the data,uses Multiple Imputation by Chained Equations(MICE)to fill in the missing data and finally obtain 70,668 complete data involving two-car rear-end collisions.Then a total of 18 accident characteristics were selected as factor variables of the model from four aspects: driver features,vehicle features,road features and environmental features,and 75 binary independent variables were finally obtained after OneHot coding.Secondly,to solve the highly imbalanced distribution of accident data severity,this paper introduces Conditional Tabular Generative Adversarial Networks(CTGAN)to the traffic safety area and does imbalance processing on the three classifications of accident data,finally generating only property damage accident: minor injury accident: serious injury fatal accident is 1:1:1.To compare the effect of different data balancing methods,the SMOTE(Synthetic Minority Over-Sampling Technique),ADASYN(Adaptive Synthetic Sampling)and KMeans SMOTE(K-Means+SMOTE)algorithms are also selected from the oversampling,and the SMOTEENN(SMOTE+ Edited Nearest Neighbours)algorithm is selected from the combine sampling.The quality of the synthetic data from different algorithms was evaluated using SDMetrics(The Software Design Metrics tool for the UML),and the results proved that the synthetic data from CTGAN were more consistent with the real data distribution,with a quality score of 0.9387 for the synthetic data of serious injury fatalities and 0.9480 for the overall synthetic data.Then,because of the low prediction accuracy for minority class data,this paper selects two ensemble learning models,random forest and XGBoost,which are suitable for imbalanced data.Based on the balanced data processed by CTGAN,SMOTE,ADASYN,KMeans SMOTE and SMOTEENN,respectively,10 prediction models were constructed.Selecting Accuracy,F1-score,G-mean and AUC as the evaluation indicators of the model.It is verified that the random forest model and the XGBoost model based on CTGAN both improve the prediction accuracy of minority class data.Compared with the random forest model,the prediction ability of XGBoost model is better.The results show that the CTGAN + XGBoost model has the highest prediction accuracy for serious injury and fatal accidents,the comprehensive classification performance is the strongest,the prediction accuracy is 0.7746,and the AUC reaches 0.6367.Finally,the visual analysis of the optimal model CTGAN + XGBoost is carried out based on the SHAP value,and the influence mechanism of each characteristic factor on the severity of the accident is deeply discussed.The influence of each factor on the accident is analyzed from four aspects: the overall accident sample,the property damage only accident sample,the minor injury accident sample and the serious injury and fatal accident sample.From the analysis results,it can be seen that: driver behavior of following vehicle before the accident,vehicle type,traffic control devices and age are the key factors affecting the severity of the accident;no matter the leading vehicle or following vehicle,men are more likely to have traffic accidents than women;road dividers can effectively avoid fatalities and reduce the risk of accidents.According to the visual analysis results,targeted improvement measures are proposed from four aspects: people features,vehicles features,roads features and the environment features,so as to actually reduce the risk of accidents and improve driving safety.This paper is based on the National Natural Science Foundation of China(NSFC),Grant No.52102404,"Research on the Mechanism of Traffic Accident Severity Considering Data Imbalance and Model Interpretability".
Keywords/Search Tags:Traffic accident severity, Class imbalance, Conditional Tabular Generative Adversarial Network(CTGAN), Extreme Gradient Boosting(XGBoost), SHAP
PDF Full Text Request
Related items