Research And Application Of Tibetan Pre-training Language Model Based On BERT

Posted on:2024-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Zhang

Full Text:PDF

GTID:2555307085470744

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years,pre-trained language models have been widely used in the field of natural language processing.One of the pre-training and fine-tuning based model training methods is a method that can train pre-trained language models on a large number of unlabeled datasets,and fine-tune the trained pre-trained language models on labeled datasets to suit downstream tasks.This approach can greatly reduce the need for labeled datasets and can save significant time and computational resources for downstream tasks.Several significant results have been achieved in the field of natural language processing by pre-trained language models.Research on Tibetan pre-training techniques can not only effectively solve the problem of scarce Tibetan tagging datasets,but also enable further development of Tibetan in the field of natural language processing.However,Tibetan language lacks open-source large-scale data and pre-trained language models,and has not been validated on Tibetan text classification experiments as well as Tibetan text summarization experiments.In addition,BERT is a deep bi-directional pre-trained language model using Transformer.After fine-tuning operations,the BERT pre-trained language model can achieve state-of-the-art performance in different tasks.However,BERT has hundreds of millions of parameters and long training and inference time,which limits its application in scenarios where computational resources are not sufficient.Recent research has used Knowledge Distillation to compress BERT into smaller,lighter,and faster inference BERT models.However,usually the reduced model size also implies reduced performance.To address the above issues,this thesis constructs a large-scale Tibetan text dataset and trains a Tibetan BERT pre-trained language model on this dataset to validate the performance of Tibetan BERT pre-trained language model through Tibetan text classification task and Tibetan text summarization task.Therefore,in this thesis,we analyze the structural features of the BERT pre-trained language model with the characteristics of Tibetan language,and integrate the knowledge distillation method to build a Tibetan pre-trained language model,and verify the effectiveness of the model by the Tibetan text summary generation task and the Tibetan text classification task and compare it with the Tibetan BERT pre-trained language model without the knowledge distillation method.In this thesis,we conduct a study on the Tibetan pre-trained language modeling method,which mainly includes the following contents:First,based on the Tibetan dataset constructed by the TNLP team of Tibetan University,the data were expanded by collecting Tibetan texts from Tibetan websites using web crawler technology,and the data were filtered and standardized to produce Tibetan pre-training data set,Tibetan text classification data set and Tibetan text summary data set respectively.Secondly,the Tibetan BERT＿mini pre-trained language model and Tibetan BERT＿base pre-trained language model are trained.Among them,the accuracy of Tibetan BERT＿mini model on MLM task is 70%,and the accuracy of Tibetan BERT＿base model on MLM task is 81%.Further,the Tibetan BERT pre-training based on knowledge distillation was carried out,and the L_attn of the model after training was 2.74.Finally,Tibetan text classification experiments and Tibetan text summarization experiments were conducted,and the classification accuracy of Tibetan BERT＿base model was 97%and the classification accuracy of Tibetan BERT model based on knowledge distillation was 87%in the Tibetan text classification experiments.On the Tibetan text summarization experiments,the Tibetan BERT＿base model is able to summarize the main contents of the text.The research for Tibetan pre-trained language model plays a role in promoting the development of Tibetan natural language processing,which has important theoretical significance and wide application value.

Keywords/Search Tags:

pre-trained language models, BERT, knowledge distillation, text classification, text summarization

PDF Full Text Request

Related items

1	Research On Key Technologies For Tibetan Abstractive Text Summarization
2	Research On Cantonese Text Sentiment Classification Based On Deep Learning
3	Research On Tibetan Text Summarization Method
4	Music Tagging Based On Text
5	Research On Grammar-aware English Text Summarization Based On Deep Learning
6	Research And Implementation Of Tibetan Text Classification Based On MLP And SepCNN Models
7	Design And Implementation Of Korean-Chinese Cross-Language Text Classification Based On Multi-Layer Semantic Feature Alignment
8	Case Studies On The Effects Of Text Summarization On Argumentation Writing Qualities Of EFL Learners At Different Proficiency Levels
9	Cross-lingual Text Summarization Model Research Based On Deep Learning
10	Methods,Models And Experiments For Crisis Classification In Arabic Language