Font Size: a A A

Research On Prediction Algorithm Of Thrombosis Risk Based On Imbalanced Data

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:J H HuangFull Text:PDF
GTID:2404330614460769Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Venous thromboembolism after orthopedic surgery is one of the main causes of death during surgery.Patients generally have no clinical manifestations during the onset of the disease,and thrombosis causes death by blocking the arteries of the organs.Almost 25% of VTE patients in the United States are admitted to the hospital,and 10% of hospitalized deaths are related to pulmonary embolism.From 2007 to 2016,the incidence of VTE in China has also increased from 3.2 per 100,000 to 17.5 per 100,000,which has played an important role in clinical research in predicting the risk of thrombosis after orthopedic surgery.However,the proportion of patients in the clinic is extremely low,and there are serious data imbalances.In the practical application of machine learning,there are still many data sets that are imbalanced,and often a small amount of data is more important.Misclassification will pay a higher price,such as financial fraud detection,fault detection,spam discrimination.Therefore,the traditional classification algorithm that takes the prediction accuracy of all data as the learning target is not suitable for the classification of unbalanced data,and has important research significance for improving the classification effect of imbalanced data.The current classification of imbalanced data is mainly solved from two aspects,one is data resampling,and the other is the improvement of classification algorithms.Based on the above background,this article addresses the classification of imbalanced data after orthopedic surgery by preprocessing real data sets,improving resampling algorithms,and combining cost-sensitive functions with ensemble learning algorithms.The main work of this article is as follows:(1)The data studied in this thesis comes from the Department of Orthopaedics of the General Hospital of the Chinese People's Liberation Army(Hospital 301).The data is authentic.In the hospital's patient data entry,it is unavoidable that there will be erroneous recordings and less recordings.Data preprocessing is an important part of machine learning.A proper preprocessing of a data set can help the classifier enhance its performance for the original data used in this paper.The problems with the original data are: incomplete data,inconsistent data,redundant data,and lack of digital features.In this paper,a set of data set processing rules is obtained by combining the doctor's guidance.According to the preprocessed data,there were 15,856 patients,including 15,328 patients without thrombosis and 528 patients with thrombosis.(2)This paper proposes an i F-ADASYN sampling algorithm,which uses the ADASYN sampling algorithm as the baseline sampling algorithm,and introduces the isolation forest algorithm to overcome its shortcomings that are susceptible to outliers.The i F-ADASYN sampling algorithm calculates the weight of minority data,determines whether the data with higher weight is an outlier,and samples the minority data after deleting the outlier with higher weight.The experimental results show that in the patient data set of orthopedic surgery,the AUC value of the i F-ADASYN sampling algorithm is improved compared with the commonly used sampling algorithms SMOTE and ADASYN,and the recognition rate of patients with thrombosis has increased by 20%.Compared with the ADASYN algorithm,the i F-ADASYN sampling algorithm has better resistance to outlier data interference,and improves the accuracy of boundary division of minority class decisions.(3)This paper proposes a cost-sensitive learning-based gradient boosting tree algorithm CO-GBDT.This algorithm introduces a cost function into the log loss function in GBDT.For two classification problems,it improves the misclassification of minority classes into majority classes.The cost makes the CO-GBDT algorithm more biased towards a few classes.Three different cost-loss ratios are used,and the original data and the IF-ADASYN algorithm resampled data in the previous chapter are used as training data sets.The two types of classification data are compared using the CO-GBDT algorithm.It can be seen that the COGBDT algorithm works better for unbalanced raw data,and its recognition rate for minority classes can reach 95%.
Keywords/Search Tags:Imbalanced data classification, venous thromboembolism, sampling algorithm, ensemble learning, cost sensitive
PDF Full Text Request
Related items