| With the changes of the times and the popularization of standardized modern Chinese in recent years,ancient literature is no longer easy to read.These ancient books carry the national wisdom and national spirit condensed by the Chinese nation for thousands of years.In order to revitalize these historical documents written in classical Chinese,high-quality,large-scale translation of ancient Chinese is crucial.Although human translation is a method of sorting out ancient books,the cost is extremely high.And with the rapid development of deep learning in recent years,it has provided technical and theoretical support for machine translation,making largescale classical Chinese translation and digitization possible.Nevertheless,there are still some problems with current machine translation models.For example,due to the lack of high-quality parallel corpus,the effect of machine translation is not ideal.In addition,existing machine translation models often only consider sentence-level translation while ignoring discourse information.This leads to the slow development of research on machine translation of classical Chinese.In order to promote the development of the digitization of classical Chinese ancient books,we have constructed a batch of high-quality sentence-level and textlevel parallel corpora.At the same time,this paper proposes a new concise chapterlevel processing method,and verifies this processing method in several classical neural network models and pre-trained models.Through these works,we have achieved certain research results,and the main contributions can be summarized as the following three points:1)Construct a parallel corpus of Zizhitongjian with context and propose a document-level processing method for adjacent clauses.In this work,we successfully constructed sentence-level,multi-resolution document-level and adjacent-clause document-level parallel corpora.In order to make better use of the contextual information of the discourse,this paper proposes a concise discourse-level processing method for adjacent clauses.Experimental results show that this processing method can better help the model understand contextual text.2)Seq2Seq(BiLSTM),Transformer Classical Chinese-Modern Chinese translation research.Based on the common end-to-end machine translation models Seq2Seq(BiLSTM)and Transformer,this paper explores the impact of different corpus sizes,word segmentation,ancient Chinese diachronic changes,and text styles on machine translation performance.The experimental results show that the size of the current training corpus is not enough to support the training of a translation model with stable performance,so this paper considers using a pre-trained model to solve the above problems.In view of the fact that the previous experimental corpus used sentence-level parallel corpus without context as the training corpus,and context information will certainly have a certain impact on translation.Inspired by it,this paper will use text-level processed corpus to conduct related experiments.3)Combination of Guwen-UniLM and Document-level.This paper proposes for the first time to combine the Guwen-UniLM pre-training model with the chapterlevel corpus.Compared with the traditional Seq2Seq(BiLSTM),Transformer and non-specific classical Chinese pre-training models(BERT,Ro BERTa),it achieves the best results on the Zizhitongjian corpus.Good translation effect.To sum up,this paper first uses the traditional end-to-end machine translation model to explore the translation of classical Chinese to modern Chinese.The results of the research indicate that the model has problems of insufficient corpus size and contextual information not considered in the corpus.In response to the above two problems,this paper uses the pre-training model to conduct related experiments and proposes a method of combining text-level corpus and Guwen-UniLM pre-training model.The experimental results show that this method can improve the translation quality. |