Font Size: a A A

Tibetan Pre-Trained Model Based On ALBERT And Its Application

Posted on:2021-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2415330611452116Subject:Engineering·Software Engineering
Abstract/Summary:PDF Full Text Request
In the field of natural language processing,we can pre-train a model on unlabeled datasets and fine-tune the model on labeled datasets to save time and computing resources when we are training a neural network.With the help of the pre-trained model,human beings have made great breakthroughs in many natural language processing tasks.The study of Tibetan pre-trained model can not only effectively deal with the lack of Tibetan labeled datasets,but also promote the development of Tibetan natural language processing research.At present,the research of Tibetan language pre-trained model is still in the exploratory stage,but its research has important theoretical significance and wide application value for the research of Tibetan natural language processing.To this end,this thesis carried out relevant research on Tibetan pre-trained model.The main research contents of this thesis include:1.There is no public dataset in Tibetan at present,this thesis scraps Tibetan corpus texts from Tibet People's Website,Qinghai Tibetan Network Radio Station Official Website,Qinghai Provincial People's Government Website,and then makes a training dataset for the pre-trained model based on the corpus provided by Professor Dora of Northwest Minzu University.At the same time,it collects data from the Chinese Tibetan Netcom to make a Tibetan text classification dataset and a Tibetan abstract extraction dataset.2.Aiming at the problem of insufficient Tibetan labeled dataset in Tibetan downstream tasks,this thesis trains the Tibetan ALBERT pre-trained model to reduce the need for labeled datasets.Finally,the accuracy of the pretraining model reached 74% in the masked language model task and 89% in the sentence-order prediction task.3.By comparing the performance differences between the ALBERT Tibetan text classification model and GBDT,Bi-LSTM,and TextCNN in text classification tasks,we verified the effectiveness of the Tibetan ALBERT pre-trained model in text classification tasks.At the same time,in order to solve the problem of sample imbalance,we use focus loss function to train the ALBERT Tibetan text classification model,the results show that the prediction results of small sample category are improved.4.The effectiveness of the Tibetan ALBERT pre-trained model in the downstream task was further verified through the Tibetan extraction abstract extraction comparison test.
Keywords/Search Tags:Tibetan, pre-training, ALBERT, text classification, abstract extraction
PDF Full Text Request
Related items