Font Size: a A A

Research On BERT’s Knowledge Distillation And Sparsity Fine-tuning Model

Posted on:2023-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:X L ShiFull Text:PDF
GTID:2558306911982199Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the pre-trained language model has achieved rapid development in natural language processing.However,with the giant size of the model,it is difficult for the pre-trained language model to be deployed in resource-constrained devices and online services.As a common method of model compression,knowledge distillation is used to learn the language knowledge of the pre-trained model by training a smaller student model,so as to reduce the size of the model while ensuring the model performance.The traditional knowledge distillation strategy adopts one-to-one layer mapping and the distillation process is very time-consuming.The student model is not only inefficient in generation process,but also difficult to learn all knowledge of the original model.At the same time,the language model deployed in the production environment needs to be fine-tuned periodically,and the memory costs of the fine-tuning process is very high,which makes it difficult to update the model frequently and lacks timeliness.In this thesis,aiming at the problem of inefficiency in knowledge distillation and fine-tuning,the performance of traditional knowledge distillation model is improved by enhancing the distillation strategy and the method of sparse attention score matrix.Specific work includes the following two points.(1)In view of the slow distillation process of BERT-EMD distillation,a BERT knowledge distillation model based on the dual EM(Earth Mover)distance is proposed in this thesis.Firstly,according to the optimal transport theory,the architecture of BERT language model based on dual EM distance for knowledge distillation is introduced.By considering the teacher layer and student layer as weight distribution,an optimal transfer matrix is learned to represent the importance of different layers.Through the transfer matrix,each student layer can learn all the language knowledge of the teacher layer.Second,based on Kantorovich’s theory,the representation,constraints and advantages of EM distance are analyzed at the mathematical level.Eventually,the two-stage method based on incremental filling is introduced to solve the dual EM distance,and the optimal transfer matrix is obtained through two stages of modification and adjustment.The experimental results show that on the GLUE and TNews benchmark,compared with theBERT-EMD4,the model proposed in this thesis improves the average knowledge distillation time by 10%.(2)Focusing on the problem of redundant attention scoress,high memory usage,and low inference speed during the fine-tuning of the BERT model,a BERT fine-tuning model,named Sparse Re LU-Self-Attention Fine-tuning(SRSAF),based on Re LU and root mean square normalization is proposed.At first,to solve the problem of confusion and inefficiency in calculating the attention score of the softmax function,the Re LU function can truncate irrelevant information flow,generate exact zero attention scores,achieve the sparseness of attention matrix,and reduce the memory cost in fine-tuning process.Next,aiming at the problem of gradient instability of Re LU function,the root mean square normalization is introduced,and the total input is regularized according to the root mean square statistics to ensure the robustness of the model and save the computational cost of the fine-tuning process.In the end,the Bert Viz visualization tool is introduced to analyze the proposed model,and the model is verified to be sparse by visualizing different layers.The experimental results show that,compared with theBERT-EMD4,the BERT sparse fine-tuning model improves the performance by 35%,reduces the average fine-tuning time by 7%,and reduces the occupancy rate of video memory by 15%on GLUE and TNews benchmark.
Keywords/Search Tags:Pre-trained Language Model, Knowledge Distillation, Attention Mechanism, Optimal Transport, Sparse Matrix
PDF Full Text Request
Related items