Font Size: a A A

Research On Tibetan Named Entity Recognition Based On Pre-training

Posted on:2024-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2555307079992479Subject:Electronic Information·Computer Technology (Professional Degree)
Abstract/Summary:PDF Full Text Request
The development and research of Tibetan natural language processing technology is the key to promote the development of Tibetan informatization.As one of the basic tasks of Tibetan natural language processing,the performance of Tibetan named entity recognition will directly affect the effect of subsequent tasks.At present,most researches on Tibetan named entity recognition are carried out based on traditional deep learning model,which relies heavily on high-quality supervised data sets.With the development and application of the pre-training paradigm,the pre-trained model has achieved the best performance in many natural language processing tasks.The pretraining method uses a large amount of unsupervised corpus to train the pre-trained model,and completes the model parameter fine-tuning through a small amount of supervised data in the downstream task,so as to improve the performance of the model in the downstream task.In order to explore the effectiveness of the pre-training method in the task of Tibetan named entity recognition,this paper carried out relevant research on the pre-trained language model of Tibetan.The main contents are as follows:1.To solve the problem of unpublished high-quality pre-training corpus and Tibetan named entity recognition supervised data set,this paper is based on the Tibetan corpus provided by The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Qinghai Normal University,and supplemented by various Tibetan text resources.A certain scale of high-quality pre-training corpus and Tibetan named entity recognition supervised data set are constructed.2.In order to solve the problem of abnormal words beginning with Tibetan nonplaceholders and Syllable generated by the published Tibetan pre-trained model directly uses Sentence Piece tool to acquire the Tibetan word segmentation model and dictionary,this paper proposes a method based on Tibetan Characters and Syllable(TiCAS)subword algorithm.3.In order to apply TiCAS subword algorithm,this paper replaces the original Word Piece word divider of BERT and the improved ELECTRA,a classical self-coding language model with excellent performance in natural language understanding tasks,with Sentence Piece word divider.The modified models are named sp BERT and sp ELECTRA respectively.4.In order to explore the effectiveness of the pre-training method in the task of Tibetan named entity recognition,four Tibetan pre-trained language models are trained in this paper.Due to differences in the code implementation of the models,TiCAS subword algorithm is only integrated into three pre-trained models in this paper,which are sp BERT,sp ELECTRA and ALBERT,while the fourth Tibetan pre-trained model Ti Ro BERTa is trained based on the original Ro BERTa model.5.By comparing the performance of the four Tibetan pre-trained models trained in this paper,two published Tibetan pre-trained models and three traditional deep learning models as the experimental baseline on the Tibetan named entity recognition task,the effectiveness of the pre-training method and TiCAS subword algorithm is verified.
Keywords/Search Tags:Tibetan, Pre-training, Named Entity Recognition, BERT, ELECTRA
PDF Full Text Request
Related items