| Natural language processing is a significant field of computer science andartificial intelligence, and it can use natural language to communicate effectivelybetween people and computer through various theories and methods. Machinelearning is a branch of the NLP research, and the premise of this study is to build alarge-scale corpus. Researches on Chinese-English bilingual corpora which includesprofessional unknown words are lacking, and it resulted non-professional andimbalance of the machine translation, which motivates the research of thisdissertation.The goal of this paper is to build a bilingual sentences alignment system. Thesystem can align the text from the section alignment into the sentence alignment.This paper is mainly divided into three parts.Firstly, we designed an evaluation function of sentence alignment, designedsentence alignment algorithm based on length and searched algorithm for sequenceof the optimal sentence. We downloaded bilingual pages from a bilingual website:China Text (CNKI). After that, we analyzed the bilingual pages, removed the pagelabels, which are useless, retained bilingual messages and established the bilingualcorpus which is based on segment alignment. We kept the Keywords of bilingualabstract, which are in the website.Secondly, we extracted dictionaries from a translation software: StarDict,analyzed original format of the dictionaries, and transformed the dictionaries into acustom format for bilingual sentence alignment system. Put English-Chinesekeywords together into the dictionary, which are extracted in the previous step. Ithelps to expand the number of words and increase the professionalism of vocabulary.Finally, we extracted English word stem using the method of extracting stem tosimplify complexity of processing English words and improved system efficiency. We achieved bilingual sentence alignment system, and did a comparative experimentwith adjusting the parameters to test performance of the system. |