Font Size: a A A

Text Classification Model Based On Distributed Machine Learning

Posted on:2024-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:X C ShengFull Text:PDF
GTID:2558307136495334Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification is one of the key technologies for efficiently utilizing massive text data.During the training of text data,some data may contain sensitive information that needs to be protected from leakage or improper use.Therefore,data privacy is an issue that needs to be taken seriously in text classification tasks.In this thesis,we propose a text classification method based on federated learning and differential privacy to effectively protect the privacy of training data.To improve the training efficiency of the server’s initialized model,we study a distributed text classification method based on Spark.Finally,we implement a distributed text classification system in hybrid mode,applying the proposed method to practical applications.Specifically,the main contributions of this thesis are as follows:(1)Proposed a distributed text classification method based on Spark,which fully utilizes the distributed computing power of Spark and the powerful text representation learning ability of the BERT model.The proposed method effectively solves the problem of low efficiency in training largescale news data with server-side initialized models,and ensures that the distributed learning and centralized learning have comparable accuracy.Experiments show that the proposed method outperforms other word embedding methods when performing text classification tasks in a distributed environment using Spark.Compared to centralized learning methods,the proposed method reduces computing time by 59.53%,significantly improving training efficiency.(2)Proposed a differentially private SGD algorithm that combines differential privacy with the federated learning framework to implement a differentially private federated BERT model for text classification.Additionally,a privacy budget calculation method was proposed in the algorithm to track detailed information on privacy loss.This method ensures that the federated learning process is not affected by inference attacks when training parameter transmission,protects the parameter information and features from being exposed,and explores the impact of different parameters on algorithm efficiency.Experimental results show that the proposed method can achieve a model accuracy of 64.8% while protecting privacy.(3)Design and implement a hybrid mode distributed text classification system.The system combines Spark and federated learning technology to perform large-scale text classification tasks in a distributed environment while protecting the privacy of training data.Functional and performance testing shows that this system can meet the functional requirements and real-time requirements of distributed text classification.In the scenario of predicting text,the classification recognition accuracy can reach 86.3%,demonstrating good practical application effects.
Keywords/Search Tags:Text classification, Distributed machine learning, Spark, Federated learning, Differential privacy, Data privacy protection
PDF Full Text Request
Related items