| With the development of traditional Chinese medicine(TCM)informatization,the objective research of TCM diagnosis has been paid more and more attention both at home and abroad.How to make full use of valuable TCM clinical data resources to provide scientific decision support for TCM diagnosis and treatment and promote the further development of Chinese medicine has become the focus of research.Data mining provides a direction for solving these problems,and classification as one of the main research contents of data mining,in the clinical diagnosis of TCM increasingly attention.Feature selection can improve the classification performance,but also provide a new way to find the relationship between the disease and the characteristics of TCM.Based on the actual situation of clinical data collected from TCM,this paper studies the classification of clinical data from three key aspects:the classification of imbalanced data,the classification of multi-label and feature selection for classification.It is expected to improve the computer-aided diagnostic capability by improving the classification performance.The main work is:First,imbalanced data on disease classification.From the data level,based on the actual situation of clinical data of TCM,based on the lack of sampling to improve.Combined with the improved sampling method,Asymmetric Bagging proposed an improved algorithm FPUSAB.The experimental resultsshow that the FPUSAB algorithm has an average increase of 10.5%in AUC and 8.4%on Bacc compared with Asymmetric Bagging.Second,Multi-label data on disease classification.In this paper,an improved algorithm WML-GkNN is proposed based on the introduction of granular computing based on the WML-kNN,which is based on the imbalance of the existing clinical data of TCM and the shortcomings of the ML-kNN.The experimental results show that WML-GkNN has an average increase of 11.2%on Hamming Loss,an average increase of 5.3%for Avg precision,an average increase of 2.1%for Coverage,an average increase of 5.1%for One-Error and average increase 7.6%on the Hamming Loss compared to the improved algorithm.Third,the impact of feature selection on classification.Chinese medicine clinical data more features,is not conducive to computer-aided diagnosis.In this paper,the PRFS-FPUSAB algorithm is proposed based on the FPUSAB algorithm.The experimental results show that the AUC has improved by 7.4%on average.For the multi-label disease classification,the HOML algorithm with good selection performance in coronary heart disease is used to check the multi-label data feature.The results show that the Hamming Loss increases by 17.77%,Avg precision is 6.28%,Coverage is 15.73%,the One-Error is 10.21%and the Ranking Loss is 25.22%on average.The selected features are consistent with the correlation disease in Chinese medicine related theory. |