| Traditional Chinese medicine(TCM)embodies the profound philosophy and wisdom of the Chinese nation for thousands of years of healthy idea,in the long-term clinical practice has accumulated rich and valuable resource,these resources various kinds and great amount of data and widely distributed in the field of traditional Chinese medicine,how to fully integrate resources,utilization and management of these data is the problem of traditional Chinese medicine.TCM prescription is an important part of TCM theory,method,prescription and medicine.It is formed by drug selection and compatibility on the basis of syndrome differentiation and treatment.Based on large-scale clinical data,effective core prescription and potential drug compatibility for disease treatment can be found to effectively assist clinical decision support.However,at present,traditional methods are still used to store and calculate TCM data,which has low scalability and is easy to reach bottlenecks.To solve this problem,this paper will effectively combine big data technology,machine learning,complex network and other algorithms to conduct distributed mining of massive clinical data.This paper mainly includes the following contents:(1)Based on the CDH(Cloudera’s Distribution Including Apache Hadoop)big data platform,completed the construction of the data warehouse of TCM big data resources.Firstly,a system structure combining top-down and bottom-up is proposed to make the logic structure of data warehouse more clear.At the same time,the multi-source data is collected into HDFS,the characteristics of the data and the relationship between them are analyzed,and the subject domain model and multi-dimensional data model are designed.Then,ETL tasks were developed using Spark,Hive QL and other technologies and ETL workflow was configured through the Dolphin Scheduler to complete the mapping of multi-source data to the data warehouse,which currently contains nearly 340 million records and about 351 GB of data.Finally,Kylin was used to construct the data cube according to the formula theme,and the multi-dimensional OLAP analysis demonstration research was carried out.The data warehouse has the functions of multi-source data integration and data processing,as well as Web multi-dimensional analysis and data mining.(2)Based on the data warehouse of TCM big data resources,the distributed mining of TCM clinical effective prescriptions was completed.Firstly,clinical diagnosis and treatment data of COPD patients are extracted from the data warehouse to form a data mart.Then,according to the patient’s treatment is divided into effective and ineffective group,and propensity score matching method is used to eliminate confounding bias between the two groups,according to effective group,extract the prescribing information construction of drug compatibility and through multi-scale backbone network algorithm to extract the core drugs subnet,through the effective prescription drug concentration analysis method(P<0.05),165 effective prescriptions were found,with an effective ratio of 80.88%,which could be used as the core prescription for the treatment of COPD.Finally,the effective drug and disease knowledge was extracted by conditional mutual information method.(3)The distributed mining research on the compatibility law of TCM prescriptions was carried out.In order to efficiently mine association rules in TCM prescriptions,a distributed CHARM algorithm was proposed in this paper.The algorithm,based on the Spark framework,effectively solved the problems of low efficiency and memory overflow of traditional methods.Aiming at the problem of the large number of association rules,this paper proposes a distributed compression algorithm to obtain fewer and more representative association rules.The experiment shows that the obtained association rules have a very good guiding significance in clinical practice. |