Font Size: a A A

Comparative Study Of Classification Method In Traditional Chinese Medicine Differentiation

Posted on:2009-09-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:S H ChenFull Text:PDF
GTID:1114360245450011Subject:Chinese medical science
Abstract/Summary:PDF Full Text Request
Backgroud:In the field of Traditional Chinese Medicine research, Differentiation is the core of it and the precondition to ensure efficacy. In order to study the classification rule of TCM, epidemiological methods, multivariate statistical methods, machine learning, neural networks, and also many kinds of other methods have been introduced into the study, which formed a extensive contend scenes.However, different methods can produce different sorters, the quality of the sorters have direct influence on the efficiency and the accuracy of data mining. At present, most research on the application of data analysis/mining methods in TCM Differentiation limit to the research method which is used, more comprehensive crosswise comparison among every kind of typical data analysis/mining methods has not yet been involved. Furthermore, the use of the model evaluation methods is derangement and irregularity. Therefore it is difficult to avoid partial view. How to correctly evaluate the value of the application of each classified methods in TCM Differentiation research, as well as respective disadvantage and merit, for making a instruction in the choice of classified method, is the prerequisite for reasonable application of methods in TCM modernizational multi-disciplinary research and has the extensive prospect for future research.The discussion of Differentiation rules in primary insomnia is one of the focuses in the present clinical research. And the application of methods is also in the same situation. This research takes it as an investigation object and collects the relational clinical data. And on this data platform, first we carry on a attribute reduction respectively based on statistics processing and rough sets method. Then with the application of typical classification methods in statistical methods, the machine learning methods and the neural network methods: the Logistic regression , the Bayesian Classifier, rule-based classified method, the C4.5 decision tree, BP, RBF neural network method, and also the probability neural network method, the support vector machines method, we perform the primary insomnia clinical TCM data classification research. And we carry on the crosswise comparison among each foregoing method and assessment of the value on their application in TCM Syndrome Classification. By this means, we discuss the data reduction, classification and model evaluation methods which meet the characteristics of the TCM data.Objective:1 Etablish classification models of Pathogenic fire derived from stagnation of liver-QI of primary insomnia with support vector machine, probabilistic neural network method. And assess its application value for TCM syndrome classification, And compared with several other commonly used classification methods, evaluate their characteristics.2 With the comparison of 3 attribute reduction methods (separately based on the correlation analysis, principal component analysis, rough set methods), assess their application value for data processing of applications in TCM syndrome research.Method:This study is a cross-sectional survey. According to relative domestic and foreign research report and TCM theory about primary insomnia, we establish "Insomnia clinical observation questionnaire", including Western medicine scales and Chinese medicine syndromes questionnaire, through which we investigate the primary insomnia out-patients in Guangdong Province Hospital of TCM.According to the content of the questionnaire, Epidata4.1a was used to the establishment of a database. After data processing such as filling missing values, discretization and normalization, bivariate correlation analysis(spearman correlation coefficient was used and the attributes which P value was above 0.05 were filtered), principal component analysis(attributes which eigenvalues was above 1 and communality was above 0.4 were extracted) by SPSS 13.0 and rough set(ROSETTA software) were respectively performed for attribute reduction (reduced-dimension).Database was split into two parts by the improved sample division method in accordance with the ratio of 5:1 (450 cases / 92 cases). Cases with random number from 0 to 92 were into test set, the other were into the training set. Then the relative models of three reduction training database were built by follows methods: Logistic Regression (Forward LR model and Backward LR model) by SPSS 13.0 software, Bayesian classifier, rule-based classification (PARI), C4. 5 decision tree method by WEKA3.5.7 software, BP neural network, RBF neural networks, probabilistic neural network method by MATLAB7.0 software neural network toolbox, and Support Vector Machine (polynomial kernel model, radial basis function kernel model and Sigmoid kernel model) by LBSVM 2.85 software.For the training set, original and five-fold cross-validation method were used to evaluate the goodness of fit and the classified effect of the established models. The major assessment index include sensitivity, specificity, accuracy, the rate of missed diagnosis, the rate of misdiagnosis, Youden index, positive predictive value, negative predictive value, positive likelihood ratio, negative likelihood ratio, consistency test (Kappa values) and the ROC curve.Then, the models were used to predict the classification results of the test set for prospective evaluation with index included accuracy, Kappa, the average absolute error, the root mean squared error.Indicators applied to assess three attribute reduction methods included attribute evaporation rate, the calculation complexity and model complexity, the classification and prediction performance of models.Through all these index we estimated the pros and cons of three reduction methods and two-categories classified models.Result:414 cases of primary insomnia patients were enrolled. 128 of which completed twice observation, 286 cases completed one observation. Taken the observation time, 542 data of Syndrome were collected with overlapping syndromes. The most syndrome is Pathogenic fire derived from stagnation of liver-QI which up to 183 cases. And we used it as an example to build the sorter.1 The original variables (including PSQI, symptoms, signs, except for light red tongue and thin whitish fur) is up to 95. The result of the reduction by bivariate correlation analysis is a subset with 55 attributes. Principal component reduction result in a 33 attribute subset and the subset reduced by rough set was the smallest, containing only 19 attributes with the highest attribute evaporation rates (65.455%). The results of models constructd by it were better than principal components reduction models and better than that of the correlation analysis reduction or similar.2 No mater which kind of model, the accuracy of original was better than that of cross-validation, even in some model the difference reached nearly 20%. However, the further use of the model, which original test accuracy was high, showed that the results turn out to be markedly lower. 3 Logistic regression model: The Backward LR model was superior to Forward LR model or similar in all indicators. No matter Forward or Backward model, the area under the ROC curve(AUC) in 5-fold cross-validation of models constructed by three reduction methods were no statistically significant. Their average correct classification rate was about 86.222%. The average AUC in 5-fold cross-validation was 0.904 without statistic significance. And average prediction accuracy was 89.855%.4 Bayesian classifier: The accuracy of Bayesian classifier set up by 3 reduction results undulated 79.111%~87.556%, average 84.148%. The average AUC in 5-fold cross-validation was 0.895, and there significant difference between models from rough or relevance reduction outcome and model from principal components reduction outcome. And average prediction accuracy was 83.696%~92.391%.5 Rule-based classifier: The models respectively constructed by three reductions contained 5,4 and 5 rules separately. The coverage rate of rules were all relatively low on the training set and there was a large gap between the accuracy of original test and that of 5-fold cross-validation. The average accuracy of three models constructed by three reductions was volatile between 77.778% and 87.556%, average 83.037%. The AUC in 5-fold cross-validation is above 0.829 in average, and the prediction accuracy was 89.348%~81.304%, 85.507% in average.6 C4.5 decision tree: the nodes of C4.5 decision trees set up by three reduction results were 15, 12 and 10 correspondingly. The training process was quickly. But three models merely covered the attributes if which was positive then the positive result turned out, so the general classification capability was mediocre. The accuracy was about 85%. The area under the ROC curve in 5-fold cross-validation was approximate 0.834 in average and that of rough set reduction model was larger than the other two models with statistic significance. The prediction accuracy was 83.696%~89.130%, and 86.957% in average.7 SVM: Among three kernel models, the best classification effect was from radial basis function kernel model with a overall surpass in all indications compared with other two kernel models, There was a significant difference of the AUC in 5-fold cross-validation between Sigmoid kernel model and BRF kernel model with less number of support vectors. After choice of the optimization parameters, the correct rate increased significantly. The classified accuracy of model set up by correlation analysis reduction results was up to 100%. Those of the other two models were about 88.222% and 92.222%. The AUC in 5-fold cross-validation was above 0.94 and that of rough set reduction model was significantly better than that of principle components reduction models. The prediction accuracy was above 92%. 8 BP Network: Three BP networks respectively with 4, 3 and 5 hidden nodes were constructed on three reduction results. Parameter settings were time-consuming and the accuracy of classification and prediction were volatile with high prediction error. The accuracy of classification was 81.778%~89.111%, and 85.185% in average. The average AUC was 0.889, and that of correlation reduction model was superior significantly against the other two reduction models. The prediction accuracy was volatile obviously between 73.913% and 95.652%, and 86.594% in average.9 RBF neural network: Three reduction subsets respectively established RBF network with 3 hidden nodes. The learning process was faster than that of BP network, also the parameter settings were simpler. The average correct classification rate was 88.741%, The AUC in 5-fold cross-validation was above 0.89 and multiple comparisons between three reduction models were all had significant difference. The average prediction accuracy was about 90.217%.10 PNN neural network: The models were with less parameter, faster running speed. The classification accuracy in 5-fold cross-certification were all above 86%, even up to approximate 95%, and 91.111% in average. The average AUC in 5-fold cross-validation was more than 0.93, average 0.967, and that of principle components reduction model was lower than the other two reduction models with statistic significance. The prediction accuracy were all higher than 90%, average 93.840%.11 According to 5-fold cross-validation AUC and the hypothesis test results, the eight models were separated into several grades by classification performance:Correlation reduction models: SVM> PNN> Logistic, RBF> PARI, BP, C4.5. And Bayesian classifier had no significant difference with all models in the latter two categories, therefore it should range between 3,4 category.Principal component reduction models: SVM, PNN> RBF, Bayes> C4.5, PARI, And because Logistic, BP had no significant difference with RBF, Bayes and C4.5, it should be categorized between 2 and 3 grade.Rough set reduction model: PNN> SVM> Bayes, Logistic, BP, C4.5> PARI, And RBF had no significant difference with PNN or SVM, so it should range between 1,2 category.Conclusion:1 The models built by attribute reduction method based on rough set can maintain a high capability of classification. The reduction can eliminate unnecessary knowledge from the information system (Decision Tables) as far as possible, result in a small subset with well ability of classification. Therefore it is a worthy reduction method in TCM syndrome data processing.2 It is possible to overestimate the effect of classifier by original test, so its practical value isn't enough and not suitable for the objective evaluation of models. While the results of 5-fold cross-validation test are more stable and can reflect the true capacity of classification of the models, especially with the interference data. It can avoid a large volatility of the classification results. And it is recommended that in the further study the use of cross-validation test should be carried on to evaluate the classifiers objectively as far as possible.3 Compared with the traditional evaluation index, ROC curve has such advantages as high reliability, accurate and objective description, specially the avoidance of the impact of bad data. It can process a hypothesis test of AUC between two diagnostic tests, so its results are more intuitive and objective.4 Overall, the eight models which is applied in this study all have certain diagnosis value, SVM, PNN, RBF is the best, then the Logistic, Bayesian classifier. And BP, C4.5, PARI is general.5 Logistic regression model has a perfect evaluation, revision system, and can clearly show the magnitude and direction of contribution of each attributes in the models. But it is easy to be infected by the collinearity and strong influential point. And the prediction accuracy and its error are in the medium sequence in eight models. Backward LR model is superior against Forward LR model. And with a second though that in the variable selection Backward LR model focus on the variables which have the strong joint action, so for the TCM syndrome data that have correlation generally, Backward LR model is suggested.6 Bayesian classifier is vulnerable to be impact by the frequency and priori probability. Its effect is similar with Logistic regression model.7 Rule-based classifier can generate easy-to-understand rules and show the strength of rules at the same time. But its classification, prediction capabilities are poor with poor stability, Thus the model is suitable for extracting rules to help understand the connotation of TCM syndromes, But unfit for classification and prediction research.8 C4.5 decision tree can generate a visual dendrogram which helps intuitive understanding of the contribution of attributes in syndrome discrimination. And it has good robustness with strong influential point. But the sensitivity, the rate of misdiagnosis, the negative predictive value and negative likelihood are relatively low, while the rate of missed diagnosis, specificity, positive predictive value and positive likelihood are relatively high. Its classification capability is mediocre with high prediction error. We suggest that the model is suitable to form a decision tree to help intuitive understanding the connotation of TCM syndromes, but not suitable for classification and prediction research.9 The radial basis functions kernel model of support vector machines is quite suitable for data analysis of TCM syndrome Research with a superiority of classification and prediction accuracy against polynomial kernel model and sigmoid kernel model, and less support vectors, It has good generalization. Therefore it is quite a good option to perform a RBF kernel when carrying on a TCM syndrome classification study. SVM can construct an optimal hyperplane for TCM syndrome data, which help to obtain a demarcation with relative high accuracy for nonlinear separable TCM syndrome data in the feature space. Its classification capability is better than other classifiers with better robustness and generalization. For these reasons, SVM technology would be feasible and effective in TCM syndrome research.10 The learning speed of BP Network for TCM syndrome diagnosis is slow. And its generalization ability is poor. It is vulnerable to fall into local minimization problem. And the feature vectors of TCM syndromes are difficult to obtain, the syndrome diagnostic accuracy is not high enough. Therefore its actual effect is relatively poor and difficult for promoting.11 The learning speed of RBF neural network is faster than BP neural network with a simpler parameter setting. It is good at classification and prediction to TCM syndrome data with better robustness and is applicable to TCM syndrome research.12 PNN neural network has fewer parameters and faster running speed. It is quite robust. Its classification and prediction accuracy are fairly high, merely inferior to SVM. It has good generalization performance and can well recognize classification information in TCM syndrome data, sequent with ideally results of syndrome classification and prediction. So it is worth to be promoting in the TCM syndrome classification research.
Keywords/Search Tags:Classification algorithm, Attribute reduction, Support Vector Machine, Probabilistic neural network, Primary insomnia, Pathogenic fire derived from stagnation of liver-QI
PDF Full Text Request
Related items