| In the construction of parallel corpus, the most common and mature softwares ofcorresponding unit automatic recognition and alignment mainly concentrate in the level ofparagraph and sentence alignment, however, the software of corresponding unit automaticrecognition and alignment based on multi-word sequence alignment (this article referred to as:word sequence corresponding unit) is rarely to be found, which has a great influence on theconstruction speed and scale of this type of parallel corpus. In order to change this situation,the design and the exploitation of the software based on word sequence corresponding unitautomatic recognition and alignment (this article referred to as CURecognizer) are taken asthe ultimate goal of this research.This research regards the theories of meaning unit, translation unit, and correspondingunit as the guidance and uses the technology of web data mining to realize corresponding unitautomatic recognition and alignment in English-Chinese corresponding text by the method ofnoun sequence automatic identification in English text.Taking noun corresponding unit automatic recognition and alignment in China’s politicalnews English-Chinese parallel corpus as the research object, the political news reports in thewebsite of China Daily (www.ChinaDaily.com.cn) were real-time downloaded and extractedby using and developing the technology of web data mining. With software automaticconstruction being reference corpus to assist English noun phrase automatic identification andjudgement, as well as the combination of grammar rules and probability statistics beingapproach principle, English text noun sequence automatic recognition software (this articlereferred to as NSRecongnizer) which is based on POS tagging is designed and exploited. Withthe help of the on-line translation tools of Google and Bing, English text noun phrase Chinesetranslation list can be gained and then according to this list Chinese corresponding wordsequence retrieval and marching can be conducted in the definite scope of Chinese text (which is automatically obtained according to the given algorithm formula of the softwarewhich is constructed by the sentence numbers in both English and Chinese correspondingtexts and the position of English noun phrase in English text), to achieve noun correspondingunit automatic recognition in English-Chinese text and further realize corresponding unitautomatic alignment in two patterns (respectively external vision pattern based on color andinternal data pattern based on database).Under the influence of CLAWS tagging correctness and the richness of network on-linetranslation, the implement effect of the CURecognizer based on this research is not very ideal.However, in the research we can find that, it is a new approach in the development of corpuslinguistics to apply the technology of web data mining to the study and the exploitation ofcorpus. |