| As the most common form of unstructured data,text data is the most important information carrier on the Internet.Today,the rapid development of modern information technologies such as the mobile Internet and the Internet of Things has caused the scale of text data to increase at a geometric rate.Text data usually contains a lot of valuable information.To conduct effective processing and mining of large-scale text data is of great significance in many fields such as education,medical care,e-commerce,social networking,etc.However,text data is more complex and abstract than other data forms.The tasks of processing and analyzing massive text data are often accompanied by huge storage and computing overhead,which make it difficult to deploy and implement these tasks on resource-constrained terminal devices.In recent years,the data processing outsourcing service in the cloud computing environment has become increasingly mature and has been accepted by more and more users.However,when users choose to use the data service outsourcing model,they also lose control of private data.The data outsourcing service providers are not completely trustable and there are also malicious adversaries.In this case,the outsourcing data is under serious security threats.Although data encryption can be utilized as a means to protect the privacy of outsourced data,there are still huge challenges to achieve security,versatility,and efficiency.In this paper,we deeply analyze the privacy issues in different steps of text data processing.We take privacy protection as the primary goal and conduct the research of privacy protection schemes mainly focusing on three key aspects of text data processing: text representation,text classification,and text search.The main contributions of our research can be summarized as follows:(1)To protect the privacy of text data during the distributed representation generation process under the data service outsourcing model,we propose a privacy-preserving distributed text representation learning scheme.We first utilize secret sharing technology to encrypt original text data and store them on two cloud servers respectively.Then we design a series of secure computing protocols based on secret sharing to achieve the distributed representation learning process.We also design a simple and efficient non-linear active function calculation method.On this basis,we utilize two most popular neural network language models and design several privacy-preserving text distributed representation learning algorithms by utilizing the secure computing protocols between the cloud servers.The algorithms can obtain distributed representations of text efficiently and accurately.Compared with the existing solutions,our scheme greatly reduces the computing and communication overhead both of the user terminal devices and the servers,which enables our scheme to meet practical application requirements.(2)To solve the privacy-preserving data classification problems in the scenarios of data sharing,we propose a privacy-preserving non-linear SVM classification scheme.To enhance security,we effectively combine secure multi-party computing technology and blockchain technology to achieve secure data sharing among multiple users and multiple computing servers.The scheme also achieves privacy protection when training and using the classification model.First,we design a consensus mechanism and an incentive mechanism based on blockchain for cooperating among multiple users and multiple computing servers.Then we utilize the secret sharing based secure multi-party computing technology to conduct data encryption and privacy protection during the outsourcing data computing process.Based on the classic support vector machine classification model,we propose a privacy-preserving classification model training scheme that supports result verification.We also design a non-linear kernel function calculation method so that our scheme can satisfy the needs of non-linear data classification.Compared with the existing privacy-preserving classification scheme,our scheme achieves higher efficiency and stronger privacy protection.(3)To solve the problems of privacy protection when conducting semantics-based search by the mobile terminal devices,we proposed a privacy-preserving multi-keyword semantics-based search scheme.To meet the users’ demand for low latency when searching,we deploy the search scheme on the edge servers which are near the terminal devices.For the purpose of minimizing the computing overhead of the resource-constrained terminal device,only very simple text pre-processing and segmentation encryption operations need to be conducted on the terminal device.The edge servers extract semantic information and establish a semantic index by utilizing the latent semantic analysis technology.They also take charge of calculating and sorting the semantic similarities according to the trapdoor submitted by the user.The scheme would not leak any private information about semantic feature attributes such as feature values and feature vectors during both the data processing phase and the searching phase.The scheme greatly reduces the storage and calculation overhead on the terminal devices,and makes the search results more suitable for the user’s query intention meanwhile.The research on the key technologies of privacy-preserving text data processing under the data outsourcing model can meet the practical needs of users for both data privacy and data processing.It also provides significant theoretical and technical support to promote the development of data service outsourcing. |