| Bilingual parallel corpus plays an important role in natural language processing tasks such as multilingual and cross-language information processing.In recent years,with the development of digital humanities research and the implementation of the "Chinese culture going global" strategy,as a main carrier of cultural communication,the literature bilingual parallel corpus provides underlying data support for cross-language classics retrieval systems and cross-language human computing studies.The bilingual parallel corpus with sentence-level alignment provides effective ordering information,and the quality of sentence alignment has a great influence on the research of cross-language retrieval system construction and knowledge extraction.Sentence alignment means that bilingual text has realized semantic matching of sentence level.It contains a variety of alignment patterns in addition to the simplest pattern of bilingual text.Therefore,automatic sentence alignment is more complicated.At the same time,because Chinese and English bilingual texts exist.Particularity,which makes sentence alignment more complicated.At the same time,it becomes more difficult because of the special characteristics of Chinese-English bilingual texts of Pre-Qin literature.This paper takes the construction of Pre-Qin Literature Chinese-English sentence-level parallel corpus and cross-language retrieval as the background,the purpose of the research is to achieve automatic alignment of the ancient Chinese-English sentences of Pre-Qin literature.Mainly around the following aspects:1.Construction of paragraph level bilingual corpusManually acquire online bilingual texts,and construct a parallel corpus of paragraph-level aligned bilingual classics in a semi-automated manner.Based on the bilingual corpus of paragraph alignment,the sentence divided into artificial bilingual-sentence alignment,and 13700 pairs of bilingual alignment sentences generated.2.Sentence alignment method selection and feature extractionThis paper uses a method which combine length and vocabulary,and introduce classification ideas into the study of bilingual sentence alignment.By analyzing the linguistic features and syntactic structure of ancient Chinese and English,combining with previous studies,four features of bilingual sentence pairs are extracted,including sentence length features,alignment pattern features,punctuation features and keyword translation features.Selecting the "Analects of Confucius" and "Book of Rites" in sentence alignment corpus,which includes a pair of aligned sentences of 5941 pairs.A set of candidate sentence pairs generated,which contains 36728 pairs of bilingual sentence pairs as experimental corpus.Using the extracted features,the bilingual corpus trained and a statistical score assigned for each candidate sentence pair,assuming that the probability of each sentence pair is independent and the probability maximum is calculated.3.Pre-Qin Literature bilingual alignment model constructionFirstly,based on the characteristics of manual extraction,experiments carried out from the perspectives of "sequence labeling" and "overall classification" by using supervised learning.In the sequence labeling experiment,the LSTM-CRF neural network model performs best which the highest F value reached 92.67%,based on this,the feature fusion experiment is carried out,and finally a effectively method for the ancient Chinese-English sentence alignment is proposed.Then explore the method of without artificial feature extraction and calculation,and based on the bilingual semantic features automatically acquired by Doc2vec,and it worked well when using the LSTM model. |