Travel is an inevitable part in our daily life,with the rising acceptance of air travel,the financial losses caused by flight delays has become a hot topic.According to statistics published by Federal Aviation Administration,the United States lost $28 billion in 2018 due to flight delays,which is a huge loss for both airlines and passengers.With the rapid development of economy,there is a rising demand for civil aviation market in China,Chinese airports handled 857 million passengers and completed 9.0492 million aircraft movements in 2020,which reached the same level of the United States in the same year,and are now facing the same problem of economic losses due to flight delays.The starting point in this paper is to analyse the data of flight delays in the United States,build the corresponding flight delays prediction models using several methods,and select the optimal model by comparing the methods.In the future,methods used in this paper can be extended to the analysis of flight delays in China to avoid risks and reduce unnecessary losses through technical means.The data in this article is from the Bureau of Transportation Statistics,which is the flight data of major American airlines in 2019.Because of the similarity and the commonality of flight delays,the analysis methods of flight delays in the United States can be extended and applied to the same analysis of flight delays in China.The data is firstly preprocessed to add the missing values and create new variables in this paper.Then the correlation between the main variables and flight delays is analyzed,visualized and displayed in an intuitive way.In new data set after variable processing,data is classified using Logistic Regression,CART-based Decision Tree,C4.5-based Decision Tree,Random Forest,XGBoost,Logistic Regression using only the new features generated by XGBoost,Logistic Regression using the new features generated by XGBoost combined with the original features and Support Vector Machine with Gaussian Kernel Function,and evaluate the classification effectiveness of each model by considering Accuracy,Precision,Recall,F1 score and AUC score comprehensively.Among the eight methods in this paper,Support Vector Machine with Gaussian Kernel Function has the best overall performance,it is effective in identifying flight delays.When selecting the optimal method,AUC score is mainly compared,and it will cause great losses to airlines if mistakenly judge a delayed flight as a punctual flight,so special attention needs to be paid to Recall.Logistic Regression is the most effective in terms of Recall,but it has poor results in AUC score,Accuracy,Precision and F1 score compared with Support Vector Machine with Gaussian Kernel Function.Random forest and XGBoost integrate multiple decision trees,and provide better classification performance than CART-based Decision Tree and C4.5-based Decision Tree.The two methods of combining XGBoost and Logistic Regression are able to generate the non-linear relationships between variables,and combine ideas of the both,but they don’t achieve the anticipated results,suggesting that the excellent methods may not be suitable for all data,and they need to be analysed on a case-by-case basis.In the study of flight delays classification,various methods should be used synthetically to analyze the key factors affecting flight delays and make predictions.At the same time,actively prepare for flight delays in actual work and take effective measures to minimize the adverse effects of flight delays. |