In recent years,pre-trained language models have been widely used in the field of natural language processing.One of the pre-training and fine-tuning based model training methods is a method that can train pre-trained language models on a large number of unlabeled datasets,and fine-tune the trained pre-trained language models on labeled datasets to suit downstream tasks.This approach can greatly reduce the need for labeled datasets and can save significant time and computational resources for downstream tasks.Several significant results have been achieved in the field of natural language processing by pre-trained language models.Research on Tibetan pre-training techniques can not only effectively solve the problem of scarce Tibetan tagging datasets,but also enable further development of Tibetan in the field of natural language processing.However,Tibetan language lacks open-source large-scale data and pre-trained language models,and has not been validated on Tibetan text classification experiments as well as Tibetan text summarization experiments.In addition,BERT is a deep bi-directional pre-trained language model using Transformer.After fine-tuning operations,the BERT pre-trained language model can achieve state-of-the-art performance in different tasks.However,BERT has hundreds of millions of parameters and long training and inference time,which limits its application in scenarios where computational resources are not sufficient.Recent research has used Knowledge Distillation to compress BERT into smaller,lighter,and faster inference BERT models.However,usually the reduced model size also implies reduced performance.To address the above issues,this thesis constructs a large-scale Tibetan text dataset and trains a Tibetan BERT pre-trained language model on this dataset to validate the performance of Tibetan BERT pre-trained language model through Tibetan text classification task and Tibetan text summarization task.Therefore,in this thesis,we analyze the structural features of the BERT pre-trained language model with the characteristics of Tibetan language,and integrate the knowledge distillation method to build a Tibetan pre-trained language model,and verify the effectiveness of the model by the Tibetan text summary generation task and the Tibetan text classification task and compare it with the Tibetan BERT pre-trained language model without the knowledge distillation method.In this thesis,we conduct a study on the Tibetan pre-trained language modeling method,which mainly includes the following contents:First,based on the Tibetan dataset constructed by the TNLP team of Tibetan University,the data were expanded by collecting Tibetan texts from Tibetan websites using web crawler technology,and the data were filtered and standardized to produce Tibetan pre-training data set,Tibetan text classification data set and Tibetan text summary data set respectively.Secondly,the Tibetan BERT_mini pre-trained language model and Tibetan BERT_base pre-trained language model are trained.Among them,the accuracy of Tibetan BERT_mini model on MLM task is 70%,and the accuracy of Tibetan BERT_base model on MLM task is 81%.Further,the Tibetan BERT pre-training based on knowledge distillation was carried out,and the Lattn of the model after training was 2.74.Finally,Tibetan text classification experiments and Tibetan text summarization experiments were conducted,and the classification accuracy of Tibetan BERT_base model was 97%and the classification accuracy of Tibetan BERT model based on knowledge distillation was 87%in the Tibetan text classification experiments.On the Tibetan text summarization experiments,the Tibetan BERT_base model is able to summarize the main contents of the text.The research for Tibetan pre-trained language model plays a role in promoting the development of Tibetan natural language processing,which has important theoretical significance and wide application value. |