Font Size: a A A

Research On Efficient Data Privacy Protection Methods For Machine Learning

Posted on:2024-10-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q LiuFull Text:PDF
GTID:1528307373969809Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years,driven by big data and high computing power,machine learning technology has been widely used in various fields.According to different training methods of machine learning models,machine learning model training scenarios can be divided into centralized machine learning scenarios and distributed machine learning scenarios represented by federated learning.However,more and more studies have shown that there are different data privacy leakage problems in centralized machine learning scenarios and federated learning scenarios.Specifically: 1)In centralized machine learning scenarios,users’ raw data is uploaded to cloud servers for machine learning model inference or training.However,these uploaded raw data often contain a large amount of privacy information,especially when it involves face data,there is a high risk of data privacy leakage?2)In federated learning scenarios,in order to improve the convergence speed of machine learning model training(to improve communication efficiency)and model accuracy,the global data sharing strategy of some federated learning algorithms has the risk of leaking the privacy of participating devices’ raw data? 3)In federated learning scenarios,although the raw data of participating devices is not sent to the central server,during the collaborative training process,model updates shared by participating devices still pose the risk of leaking the privacy of participating devices’ raw data.Although there have been many studies on protecting data privacy,these studies have the drawbacks of high computational overhead and low communication efficiency.To this end,this dissertation focuses on proposing efficient data privacy protection methods for centralized machine learning scenarios and federated learning scenarios on the premise of protecting data privacy.The main research contents and contributions can be summarized as follows:(1)For centralized machine learning scenarios,this dissertation proposes an efficient face de-identification framework based on embedded autoencoders to address the privacy protection issue of uploading face data,effectively protecting the privacy of face identity.Concretely,this framework consists of three parts: a privacy removal network,a feature selection network and a privacy evaluation network.In order to reduce the computational cost during the face de-identification process,this dissertation focuses on designing the structure of the privacy removal network,which employs two different autoencoders,one of which is embedded within the other.In addition,in order to enable the privacy removal networks to effectively remove the identity-related information from face images while retaining desired facial attributes,an adversarial training manner is utilized.The experimental results show that compared with existing methods,the proposed framework can save up to 99.30% of computational overhead.Besides,the proposed framework outperforms existing methods by up to 26.22% in terms of data utility.(2)For the federated learning scenarios with non-independent and identically distributed(non-IID)data,in order to address the privacy protection issue of global data sharing,this dissertation proposes an efficient privacy-preserving federated learning algorithm based on selective data collection,which realizes privacy-preserving global data sharing.Concretely,in order to prevent the global data sharing from leaking the privacy of participating devices’ raw data,this dissertation employs the recently popular generation models(the stable diffusion model and the large language model)on the central server side to generate candidate data for training the generative adversarial network,which avoids the collection of raw data of participating devices.Meanwhile,in order to ensure that the candidate training data has a similar domain to participating devices’ raw data,this dissertation proposes a selective data collection algorithm that can select representative participating devices and require them to share specific local class prototypes with the central server.Then,these collected prototypes are utilized to select qualified training samples from the candidate training data.In addition,in order to improve the model training convergence speed and model test accuracy,this dissertation proposes a privacy-preserving dual-calibration approach on the participating device side,which effectively reduces the deviation between participating devices’ local models and the global model.The experimental results show that compared with existing methods,the proposed algorithm can achieve the same model test accuracy while reducing communication overhead by up to52.49% on the premise of protecting the raw data privacy of participating devices.(3)For the federated learning scenarios with non-IID data,in order to address the privacy protection issue of model updates sharing,this dissertation proposes an efficient privacy-preserving federated learning algorithm based on variational autoencoders,which effectively prevents gradient information in model updates from leaking the privacy of participating devices’ raw data.Specifically,to prevent the gradient from leaking the privacy of the raw data,this dissertation introduces a data mixing module on the participating device side.This module is able to add noise perturbation to the raw data of participating devices.During the local model training phase,the participating device’s raw data will first be sent to the data mixing module to obtain the mixed data,and then the mixed data will be utilized for the local model training.Therefore,attackers cannot use the gradient to recover the raw data of participating devices.In addition,in order to improve the convergence speed of model training and model test accuracy,this dissertation proposes a privacy-preserving global dataset distillation method on the central server side.The obtained global dataset can be used to compensate for the performance loss caused by the data mixing module.Experimental results show that compared with existing methods,the proposed algorithm can achieve the same model test accuracy while reducing communication overhead by up to 87.41% on the premise of protecting the raw data privacy of participating devices.
Keywords/Search Tags:Centralized machine learning, Non-IID data, Federated learning, Efficient data privacy protection
PDF Full Text Request
Related items