Font Size: a A A

Research On Diagnosis Technology Of Liver Disease Based On Decision Tree

Posted on:2020-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:S H YangFull Text:PDF
GTID:2404330596996919Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Liver disease is a disease that occurs in the liver and is a common and extremely harmful disease.The diagnosis methods of liver diseases have been widely concerned by scholars.With the development of big data technology,the application of data mining technology in the field of medical diagnosis is increasing.Research of liver disease diagnosis system based on data mining technology has also become one of the hot research topics in recent years.This thesis with the goal of constructing a liver disease diagnosis model,to explore the data imbalance problem,the curse of dimensionality problem and the choice of model modeling methods that encountered during the construction process.Firstly,a new algorithm BN-SMOTE is proposed for the defect of Borderline-SMOTE algorithm when solving the data imbalance problem.Then,in solving the curse of dimensionality problem,a new multi-criteria fusion feature selection algorithm is proposed for the problem that the single criterion filtering feature algorithm cannot fully evaluate the feature subset.The problem proposes a new multi-criteria fusion feature selection algorithm.Finally,a liver disease diagnosis model is established based on the decision tree algorithm.The specific work is as follows:(1)A new algorithm BN-SMOTE is proposed to solve the problem that the classical borderline-smote algorithm misses a few important sample points at the decision boundary and leads to the degradation of the model classification accuracy when generating new samples.The algorithm first calculates the nearest neighbors of the minority sample set in the majority class set,obtains a new majority sample set at the decision boundary,and then takes the nearest neighbor of the new majority sample set in the minority sample set to find the decision boundary.The minority sample set solves the problem that the Borderline-SMOTE algorithm misses a few sample points in the boundary region.Among the experimental results of three public data sets,the G-mean and F-value values of the BN-SMOTE algorithm under the C4.5 decision tree are increased by 3.84% and 2.26% respectively compared with the Borderline-SMOTE algorithm.Compared with the latest RBO and CN-SMOTE algorithms,it is also more advantageous when dealing with unbalanced data.(2)Aiming at the defect that the traditional filtering feature selection algorithm has a single evaluation criterion,cannot comprehensively evaluate the feature subsets and reduce the classification accuracy of the classification model,a new feature selection algorithm MFMSC is proposed.The algorithm combines mutual information,chisquare test and Relief-F three evaluation methods to select the optimal feature subset.Firstly,it is considered that the greater the diversity of feature subset used for fusion is,the better the fusion effect will be.Then,three sets of feature weight vectors are weighted and fused to obtain the new feature weight,so as to determine the optimal feature subset after multi-criteria fusion.The experiments of 4 sets of public datasets show that the accuracy of the MFMSC algorithm is 2.66% higher than that of the mutual information method under the C4.5 decision tree classifier,and the accuracy of the chisquare test method is improved by 1.78%,which is better than the Relief-F method.The increase is 1.24%,which effectively improves the classification accuracy of the classification model.At the same time,compared with the latest UFSACO and FSCBAS algorithms,the MFMSC algorithm has higher classification accuracy in dataset experiments with different characteristics and has certain advancement.(3)In order to reduce the probability of misdiagnosis of critically ill patients as mild patients in liver disease diagnosis model,based on C4.5 decision tree,a decision tree algorithm with decision risk matrix,DRM-C4.5 algorithm,is proposed.The basic idea of the algorithm is: The concept of misjudgment cost and decision risk matrix is proposed,and the misclassification cost is added to the gain rate calculation process as the new attribute division basis.The innovation of this algorithm is that the cost of misjudgment is considered when classifying the optimal attributes,so as to reduce the misdiagnosis probability of severe patients.Experiments were carried out with the real data provided by the Fifth People's Hospital of Wuxi.Compared with the ID3,C4.5 decision tree algorithms and the latest Boosted C5.0 and MLP algorithms the classification accuracy of the DRM-C4.5 algorithm still has an advantage,and the highcost sample error rate is reduced to close to 0.It indicates that the liver disease diagnosis model based on the DRM-C4.5 decision tree has a significant reduction in the probability of misjudgment in critically ill patients,and meets the special requirements of the liver disease diagnosis model.
Keywords/Search Tags:liver disease diagnosis, unbalanced data learning, oversampling algorithm, feature selection, decision tree
PDF Full Text Request
Related items