| The exploitation of electronic health records(EHRs)has important socioeconomic value,but at the same time,the sensitivity and privacy of such data make data mining and data privacy protection a contradiction.How to solve this conflict is an issue of concern to both academia and industry.A typical data mining development process includes data pre-processing,feature engineering,model training,model debugging,and model evaluation.Current work focuses on data privacy protection in model training,i.e.,assuming that the model structure and initial configuration have been set and the model does not need to be debugged.However,in practice,in addition to model training,there are other parts of the development process that require human participation in data exploration and modeling by directly accessing raw data,i.e.,the "human-in-the-loop".The new technical challenge is to balance privacy protection and data modeling.This dissertation proposes Secure Split Learning:a trustworthy and supervised data analysis architecture for processing models by designing a debugging environment and a runtime environment.The debugging environment provides synthetic data for data analysts to build machine learning pipelines.The runtime environment provides a data sandbox for model training,which is used to train the model on original data and to scrutinize the outputs when it’s being taken away.This dissertation investigates the key technologies of Secure Split Learning,including data mining,data publishing,and data evaluation under privacy protection,and further uses data mining of electronic health records as an example to quantitatively compare the SOTA(State-Of-The-Art)performance of original data mining without privacy protection to the performance under privacy-preserving conditions.The results show that the Secure Split Learning architecture and key techniques can achieve consistent results without privacy protection.The specific compendium is as follows:The existing privacy-preserving method focuses only on the model training phase and ignores other aspects leading to a loss of utility of machine learning modeling.This chapter proposes a semi-automatic model mining technique based on fine-grained statistical analysis,as a way to achieve optimal performance of data mining without privacy protection.This is then further crossvalidated with the best automatic machine learning prediction results to provide a basis for improving the efficiency of model mining under privacy protection.The experiments show the importance of constructing privacy-preserving machine learning modeling that includes human-in-the-loop,achieving a 9.2%prediction improvement compared to the automated approach.Modeling the probability distribution of rows in structured electronic health records and generating realistic synthetic data based on the generative adversarial network is a non-trivial task.Tabular dataset usually contains discrete columns,and traditional encoding approaches may suffer from the curse of feature dimensionality.This chapter proposes a representational learning-based generative adversarial network technique,to synthesize structured electronic health records using hyperbolic space.Therefore creates synthetic data that approximates the original data distribution,and the problem of sensitive information leakage is mitigated.The results of this chapter suggest that the generated training data are experimentally tested to achieve only a 2.0%difference in utility from the original data while ensuring privacy.To fully demonstrate the practical application of data mining based on Secure Split Learning,this chapter performs a medical analysis of electronic health records in the real-world environment based on a trusted privacy computing platform.This chapter proposes a feature selection algorithm based on data augmentation under privacy-preserving conditions,which effectively solves the problems of small data volume and excessive feature values of real electronic health records.Additionally ensures the sharing of model mining results rather than original data.The results of this chapter show that it is possible to ensure that personal health data maintain its security and confidentiality.Thus significantly improving the mining capability of medical data(24.9%improvement in AUC).Privacy risk assessment for data publishing is of great importance,and data desensitization of original EHRs leads to reduced utility,thus failing to achieve a balance between prediction accuracy and privacy protection.This chapter proposes a privacy assessment scheme for sampled data,based on the results,and further proposes a debugging technique for embedded feature selection based on a clustering algorithm,as a way to achieve classification improvement after data sampling.The results of this chapter show it can achieve prediction accuracy superior to the original EHR by 8.7%while preventing sensitive information leakage,achieving the goal of balancing privacy and usability. |