| In recent years,with the advancement of Io T and AI technology,machine learning has been applied in many industries including e-commerce,image recognition,health analysis and behavior analysis,and achieved good results.At the heart of this success is the use of large-scale,high-quality data for model training.Currently,word vector training is facing serious privacy protection security issues: Word2 Vec is a word vector training algorithm,the more abundant the training data,the higher the accuracy of the word vector.However,training word vectors in machine learning requires a large amount of data sets for training,but some of these data sets may contain private and sensitive information,and the training phase will lead to the exposure of training models and user data sets.In order to provide users with available,reasonable and personalized solutions,this process often also violates their privacy.For example,user data may include face,iris and personal identification,etc.Gathering data from multiple sources proved to be a difficult task due to legal concerns,competitive advantages and privacy concerns.Moreover,centrally collected data may be permanently stored and used without the knowledge of the data owner,but some of these data sets may contain private and sensitive information,which hinders the training of high-quality models.Inner product encryption and homomorphic encryption are emerging privacy protection frameworks,which can prevent attackers from having arbitrary background knowledge and provide secure privacy protection.Aiming at the above problems of word vector training,the Word2 vec algorithm that satisfies privacy protection is studied.The main research contents are summarized as follows:1.Research the privacy-preserving word vector training model based on Hierarchical Softmaxs.Hierarchical Softmax uses the Huffman tree to turn the N classification problem into a log N hierarchical binary classification problem;because only the weight vector of each node in the Huffman tree is calculated,this greatly improves the efficiency of model training.This thesis is based on the BGN cryptosystem and the inner product encryption design,the calculation efficiency of the Sigmoid function is higher than the existing scheme,and the communication overhead is greatly reduced.Based on Hierarchical Softmaxs,this paper proposes two privacy protection algorithms on the two models of Word2 vec,and compares them with the existing privacy protection word vector training model,which not only reduces the communication overhead,but also has a significant advantage in computing time.Under the premise of protecting privacy,the training time of the model is shortened.2.Research on a privacy-preserving word vector training model based on negative sampling algorithm.Hierarchical Softmax is not ideal for training vocabulary with low word frequency.Under the strategy of the negative sampling algorithm,the negative sampling algorithm only needs to update one sample of the output vector,and the optimization goal is to minimize the probability of negative samples and maximize the probability of positive samples.Based on the negative sampling algorithm,this paper proposes two privacy protection algorithms on the two models of Word2 vec.Compared with the current privacy-protected word vector training algorithm in this paper,this algorithm has higher efficiency and smaller storage overhead.The accuracy,efficiency and precision of the algorithm are also evaluated experimentally.The experiment shows that the function of this algorithm is basically the same as that of the plaintext algorithm.All experiments in this paper are conducted on four large real-world datasets. |