Font Size: a A A

Research On Automatic Texts Segmentation And Word Segmentation For Ancient Chinese Texts

Posted on:2021-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y WeiFull Text:PDF
GTID:2415330620468583Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Because of the significant differences between the two languages,it is infeasible to borrow models directly from similar tasks in modern Chinese to ancient Chinese research.Therefore,existing methods and models adopted by researchers are outdated.With the rapid advancement of deep learning,especially the emergence of pre-trained models on massive amount of texts,deep models have made huge improvements in many modern Chinese natural language processing tasks.In order to keep pace with the trend,we first collected and pre-processed ancient Chinese texts from the Internet,nearly four hundred million characters in total,to pre-train a BERT model for ancient Chinese ourselves.This can be considered as a breakthrough since it is the first attempt to introduce pre-trained language model into ancient Chinese research.Then we focus on applying the BERT model to two most basic and crucial tasks for ancient Chinese,automatic texts segmentation and word segmentation.Texts segmentation and punctuation task aims to transform a string of characters into reasonable and understandable sentences by adding separation or punctuation marks.Current methods,regardless of rule-based system,statistical machine learning or deep learning,failed to generalize well due to the insufficient amount of training data.Word segmentation is similar to texts segmentation task in definition.However,the size of hand-labelled data is very limited.Existing research are done only within a narrow field.To apply those approaches to various styles of texts,produced in Chinese three-thousand-year long history,may be extremely complicated,if not downright impossible.On both automatic texts segmentation and punctuation tasks,BERT with fine-tuning outperforms both the existing bidirectional GRU model and our baseline model(BiLSTM + CRF)by a significant margin and achieves the state-of-the-art results.Furthermore,our approach also shows its excellent generalization ability.Since ancient Chinese lacks the ideas of sentences and paragraphs,in contrast to our model evaluation,the actual input may possibly be thousand-character long.In order to put our model in practical use,we design a text segmentation methods based on sliding window,which set no limits on the length of the input sequence.On automatic word segmentation for ancient Chinese task,we are the first to adopt unsupervised learning.By combining the non-parametric Bayesian models with the deep neural language model,BERT,our model is able to deal with the insufficient amount of hand-labelled data.With only limited amount of hand-labelled data added,its performance is on par with existing models trained on relatively large amount of data.We propose the Multi-Stage Iterative Training(MSIT)for unsupervised word segmentation which achieves the F1 score of 90.81% on Zuozhuan(an ancient Chinese history book)dataset.After adding only 500 ground truth sentences,which can be considered as weakly supervised learning,the F1 score reaches 95.55%.When using the same training set(one hundred and fifty thousand characters in total)as existing literatures,our method gets the F1 score of 97.40%,the state-of-the-art result.By experiments on texts from different styles and period,our proposed method also proves that it has better generalization ability which cannot be achieved through supervised method alone.
Keywords/Search Tags:Automatic texts segmentation, Word segmentation for ancient Chinese texts, BERT, Nonparametric Bayesian models, Weakly supervised learning
PDF Full Text Request
Related items