Chinese And English Parallel Corpus Sentence Alignment Of Pre-Qin Literature Based On Multiple Models

Posted on:2020-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:J W Liang

Full Text:PDF

GTID:2505306314995909

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

Bilingual parallel corpus plays an important role in natural language processing tasks such as multilingual and cross-language information processing.In recent years,with the development of digital humanities research and the implementation of the "Chinese culture going global" strategy,as a main carrier of cultural communication,the literature bilingual parallel corpus provides underlying data support for cross-language classics retrieval systems and cross-language human computing studies.The bilingual parallel corpus with sentence-level alignment provides effective ordering information,and the quality of sentence alignment has a great influence on the research of cross-language retrieval system construction and knowledge extraction.Sentence alignment means that bilingual text has realized semantic matching of sentence level.It contains a variety of alignment patterns in addition to the simplest pattern of bilingual text.Therefore,automatic sentence alignment is more complicated.At the same time,because Chinese and English bilingual texts exist.Particularity,which makes sentence alignment more complicated.At the same time,it becomes more difficult because of the special characteristics of Chinese-English bilingual texts of Pre-Qin literature.This paper takes the construction of Pre-Qin Literature Chinese-English sentence-level parallel corpus and cross-language retrieval as the background,the purpose of the research is to achieve automatic alignment of the ancient Chinese-English sentences of Pre-Qin literature.Mainly around the following aspects:1.Construction of paragraph level bilingual corpusManually acquire online bilingual texts,and construct a parallel corpus of paragraph-level aligned bilingual classics in a semi-automated manner.Based on the bilingual corpus of paragraph alignment,the sentence divided into artificial bilingual-sentence alignment,and 13700 pairs of bilingual alignment sentences generated.2.Sentence alignment method selection and feature extractionThis paper uses a method which combine length and vocabulary,and introduce classification ideas into the study of bilingual sentence alignment.By analyzing the linguistic features and syntactic structure of ancient Chinese and English,combining with previous studies,four features of bilingual sentence pairs are extracted,including sentence length features,alignment pattern features,punctuation features and keyword translation features.Selecting the "Analects of Confucius" and "Book of Rites" in sentence alignment corpus,which includes a pair of aligned sentences of 5941 pairs.A set of candidate sentence pairs generated,which contains 36728 pairs of bilingual sentence pairs as experimental corpus.Using the extracted features,the bilingual corpus trained and a statistical score assigned for each candidate sentence pair,assuming that the probability of each sentence pair is independent and the probability maximum is calculated.3.Pre-Qin Literature bilingual alignment model constructionFirstly,based on the characteristics of manual extraction,experiments carried out from the perspectives of "sequence labeling" and "overall classification" by using supervised learning.In the sequence labeling experiment,the LSTM-CRF neural network model performs best which the highest F value reached 92.67%,based on this,the feature fusion experiment is carried out,and finally a effectively method for the ancient Chinese-English sentence alignment is proposed.Then explore the method of without artificial feature extraction and calculation,and based on the bilingual semantic features automatically acquired by Doc2vec,and it worked well when using the LSTM model.

Keywords/Search Tags:

Multilingual information processing, Cross-language information processing, Sentence alignment, Pre-Qin Literature, Chinese-English parallel corpus

PDF Full Text Request

Related items

1	Parallel Processing On Parallel Corpus Of Chinese-English
2	The Study Of The Information Processing Characteristics In The Obsessive-Compulsive Tendency Individuals
3	Interaction Between Visual Information And Textual Information-Based On Second Language Embodied Language Processing
4	A Study On Chinese Mongolian Word Alignment And The Related Technologies
5	Syntactic Or Semantic Information Advanced:an Investigation Into The Mechanism Of Pre-intermediate Chinese EFL Learners In Sentence Processing
6	Processing Specificity Of Threat Action Information: Behavior And ERPs
7	The Pre-activation Of Native Language Phonological Information During L2 Spoken Sentence Processing
8	The Influence Of Different Proficiency In The Second Sentence Processing On The Categorical Cognition Of Verbs
9	Information Processing Of Developmental Dyslexia In Chinese Children
10	Information Processing Strategies Of Political Interviews