Font Size: a A A

Research On Cardiovascular Disease Diagnosis Model Key Technologies Based On Maching Learning

Posted on:2024-02-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S ZhangFull Text:PDF
GTID:1524306944470244Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid improvement of economic development,the morbidity and mortality of Coronary Artery Diseases have an growth trend year by year,and tend to be younger and more complex.Coronary Artery Diseases have brought enormous economic burden and losses to both patients and the country.The medical research and clinical practice have shown that early diagnosis and early detection are the most effective means to reduce the mortality and disability rate of Coronary Artery Diseases.In recent years,machine learning and artificial intelligence technologies have been widely applied in the early non-invasive and accurate diagnosis of Coronary Artery Diseases.Through investigation and analysis into the existing research,it is found that there exists four main problems in the research of Coronary Artery Diseases diagnosis based on machine learning.First,in the aspect of feature selection,how to identify and select simplified feature subsets related to the diagnosis of Coronary Artery Diseases from redundant feature attributes.Second,in the aspect of imbalanced data,how to effectively solve the problem of imbalanced data existing in the sample data.Third,in terms of model construction methods,how to build a Coronary Artery Diseases diagnosis model with better performance based on ensemble learning algorithm.Fourth,in the aspect of model evaluation,how to analyze and study the interpretability of the constructed model.To solve the above problems,based on the multi-modal data of Coronary Artery Diseases,this paper proposes the ensemble feature selection algorithm of medical data based on multiple evaluation criteria,the hybrid sampling method based on clustering and feature distribution,and the Coronary Artery Diseases diagnosis model based on ensemble learning and multi-modal data,and analyzes the interpretability of the model.To be specific,the main contributions of this paper are as follows:Firstly,a ensemble feature selection algorithm of medical data based on multiple evaluation criteria is proposed.This algorithm combines the feature selection advantages of Boruta,GBDT,and AdaBoost algorithms.First of all,the feature rank sequence or feature importance sequence under the three algorithms are obtained based on their respective screening mechanism and evaluation criteria.And according to the set feature exclusion criteria,the features that are considered unimportant in the three algorithms are deleted.Then,the Borda Count voting method based on the feature sequence is used to integrate the remaining features in the three sequences to get a new feature sequence.Finally,the simplified feature subset related to the diagnosis of Coronary Artery Diseases is obtained based on the new feature sequence combined with the subset division and classification algorithm.And the effectiveness,feasibility and superiority of the proposed ensemble feature selection algorithm are verified by setting control experiments on two real Coronary Artery Diseases data sets.Secondly,a hybrid sampling method based on clustering and feature distribution is proposed.First,this method uses multiple clustering algorithms to discover potential intra-class imbalance in the data,and applies statistical testing methods to analyze the differences in feature distribution among clusters to evaluate the effectiveness of each clustering algorithm.The clusters with the largest difference in feature distribution are selected for subsequent sampling.Then,the SMOTE oversampling algorithm is applied to implement sampling from both intra-class and inter-class levels.Finally,the Tomek Links undersampling algorithm is used to clean the noisy samples and class boundary overlapping samples in the sampled dataset.This method can solve the problem of data imbalance existing in the dataset at both intra-class and inter-class levels simultaneously,and can clean the noise samples and class boundary overlapping samples that exist in the original dataset or introduced by resampling.In addition,the resampling based on the difference of feature distribution helps to explore more potential intra-class spatial distribution in the data set,which helps to improve the generalization ability and practical application value of the model.Finally,the effectiveness,feasibility and superiority of the proposed hybrid sampling method are verified by setting up control and ablation experiments on 7 public real coronary heart disease data sets with different proportions of imbalance problems.Thirdly,a Coronary Artery Diseases diagnosis model based on ensemble learning and multi-modal data is constructed,and the SHAP interpretability analysis method is used to deep analysis the interpretability of the model.Combining the multi-modal Coronary Artery Diseases data sets from multiple sources,applying ensemble learning algorithm to design a Coronary Artery Diseases diagnosis model with three-layer classifier structure.The SHAP interpretability analysis method and medical professional knowledge are used to analysis the interpretability of the model,and a method for decomposing and calculating feature contribution in the ensemble model is proposed.More specifically,in the stage of data preprocessing,in response to the problem of missing values,microeforest multiple imputation model is applied for imputation of missing values,and statistical testing methods are introduced to statistically infer the distribution of feature attributes before and after imputation,in order to guide the optimization of missing value imputation methods,the selection of imputation values,and the validation of imputation effects.In the feature engineering stage,the proposed ensemble feature selection algorithm of medical data based on multiple evaluation criteria and the hybrid sampling method based on clustering and feature distribution are applied to process the dataset respectively,and two data sets with large differences in the number of features and the number of samples are obtained.This operation increases the diversity at the dataset level and helps improve the integration effect.In the model construction stage,in the first layer classifier architecture,10 kinds of classification algorithms are used to train the classifier separately on two data sets.In the second layer architecture,voting ensemble learning algorithm is applied to vote ensemble the classifier trained in the first layer.The third layer uses the Stacking ensemble learning algorithm to reintegrate the voting ensemble classifier trained in the second layer.The Bayesian optimization(BO)algorithm is applied to the optimization of models parameters and the allocation of weight values of the weighted voting ensemble method.In the aspect of model performance evaluation,8 indexes are used to evaluate and analyze the performance of the proposed Coronary Artery Diseases diagnosis model comprehensively.And the excellent performance of the proposed Coronary Artery Diseases diagnosis model is demonstrated by comparing it with the latest results currently available in the research field.
Keywords/Search Tags:coronary artery diseases, machine learning, feature selection, hybrid sampling, ensemble learning, model interpretability
PDF Full Text Request
Related items