Font Size: a A A

The Research Of Extraction Of Chinese-Uyghur Parallel Texts From Web

Posted on:2012-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:J F LiangFull Text:PDF
GTID:2178330335486134Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese-Uyghur bilingual parallel corpus is an important resource for the development of Chinese-Uyghur statistical machine translation system, but the current bilingual parallel corpus of Chinese-Uyghur can not meet the actual needs, for its small-scale, untimeliness, and areas of poor balance. To improve this situation, the research of page-downloading, page-denoising and the parallel text recognition has been done in the thesis, and the main results obtained are as follows:First of all, tools to download web page has been implemented based on the need of research. The tool downloads pages using the breadth-first approach, not only more suitable for downloading large sites, but also easier to implement breakpoint recovery, incremental downloading and other functions.Secondly, the links in page has been classified and processed, and then page has been denoised on the basis of the text length and text density characteristics after the source web page has been divided using the algorithm based on page structure and statistics. This method can further improve the efficiency and effectiveness of denoising page.Then, getting the Chinese-Uyghur candidate parallel texts pairs based on the digits which exist in Chinese text and Uyghur text and the ratio of lengths of the Chinese-Uyghur texts pairs, and identify parallel texts pairs characterized by the translation degree of noun and the ratio of the number of words from the candidate parallel texts pairs using the SVM classifiers,Finally, a system automatically obtaining Chinese-Uyghur parallel texts from the network has been achieved to improve the Status of Chinese-Uyghur bilingual corpus.
Keywords/Search Tags:Downloading, Web denoising, Parallel text recognition
PDF Full Text Request
Related items