Applying Web Data Mining To The Parallel Corpus: The Automatic Identification And Alignment Of The Corresponding Units

Posted on:2013-07-31

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Han

Full Text:PDF

GTID:2235330374460414

Subject:Foreign Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

In the construction of parallel corpus, the most common and mature softwares ofcorresponding unit automatic recognition and alignment mainly concentrate in the level ofparagraph and sentence alignment, however, the software of corresponding unit automaticrecognition and alignment based on multi-word sequence alignment (this article referred to as:word sequence corresponding unit) is rarely to be found, which has a great influence on theconstruction speed and scale of this type of parallel corpus. In order to change this situation,the design and the exploitation of the software based on word sequence corresponding unitautomatic recognition and alignment (this article referred to as CURecognizer) are taken asthe ultimate goal of this research.This research regards the theories of meaning unit, translation unit, and correspondingunit as the guidance and uses the technology of web data mining to realize corresponding unitautomatic recognition and alignment in English-Chinese corresponding text by the method ofnoun sequence automatic identification in English text.Taking noun corresponding unit automatic recognition and alignment in Chinaâ€™s politicalnews English-Chinese parallel corpus as the research object, the political news reports in thewebsite of China Daily (www.ChinaDaily.com.cn) were real-time downloaded and extractedby using and developing the technology of web data mining. With software automaticconstruction being reference corpus to assist English noun phrase automatic identification andjudgement, as well as the combination of grammar rules and probability statistics beingapproach principle, English text noun sequence automatic recognition software (this articlereferred to as NSRecongnizer) which is based on POS tagging is designed and exploited. Withthe help of the on-line translation tools of Google and Bing, English text noun phrase Chinesetranslation list can be gained and then according to this list Chinese corresponding wordsequence retrieval and marching can be conducted in the definite scope of Chinese text (which is automatically obtained according to the given algorithm formula of the softwarewhich is constructed by the sentence numbers in both English and Chinese correspondingtexts and the position of English noun phrase in English text), to achieve noun correspondingunit automatic recognition in English-Chinese text and further realize corresponding unitautomatic alignment in two patterns (respectively external vision pattern based on color andinternal data pattern based on database).Under the influence of CLAWS tagging correctness and the richness of network on-linetranslation, the implement effect of the CURecognizer based on this research is not very ideal.However, in the research we can find that, it is a new approach in the development of corpuslinguistics to apply the technology of web data mining to the study and the exploitation ofcorpus.

Keywords/Search Tags:

Corresponding units, Noun sequences, Parallel corpus, Automatic alignment, Web data mining

PDF Full Text Request

Related items

1	Research On Data Mining Of Massive Minority Cultural Resources Based On Spark
2	Research And Implementation Of Automatic Labeling System For Quasi Writtern Language Korean Speech Corpus
3	Corresponding Units In Chinese-English Parallel Texts--Corpus-driven Approach
4	A Computational Research On The Unit Of Translation For Automatic Bitext Alignment
5	Term Translation Pair Alignment Based On A Bilingual Parallel Corpus Of Chinese Historical Classics
6	Parallel Processing On Parallel Corpus Of Chinese-English
7	A Study On Construction Principle And Application Of Chinese-English Parallel Translation Corpus
8	Chinese And English Parallel Corpus Sentence Alignment Of Pre-Qin Literature Based On Multiple Models
9	Analysis And Research To The Data Of CET-4 Score Based On Data Mining
10	Sentence Level Alignment In The English-Chinese Parallel Corpora And The Application In Machine Translation Studies