Font Size: a A A

The Research Of The Uyghur Sentence Word Clustering And Chinese-Uyghur Word Alignment

Posted on:2013-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:X TanFull Text:PDF
GTID:2218330374966456Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Word clustering and word align is an essential and crucial issue in the field of cross-languages' natural language processing. Many applications based on bilingual corpus, such as SBMT, EBMT, WSD, dictionary compilation as well as bilingual teaching, are in need of word clustering and the alignment of words' level.The Uyghur which belongs to the groups of Altai Turkic, is based on the usage of Arabic letter. Only added certain affixes to the original words, the majority of fresh words in modern Uyghur are fairly related to the original words. So are the meanings of those new words.Word clustering is a very fundamental issue in the field of natural language processing. And there has been few studies on word clustering as to the Uighur language up to now. This dissertation analyses the structural characteristics of words in the Uyghur, and manages to put forward two calculating methods, which respectively base on analysing the characteristics of word length and words' morphological features.This dissertation mainly focuses on the similarity of the two computing ways by the analysis of word length and their morphological characteristics, so as to improve the accuracy of words clustering and recall rate. Based on the K-Means algorithm, this paper puts forward two fresh calculating methods based on the similarity of words length and words' morphs. The one in terms of words length makes use of the Kansai method to figure out the similarities of words. While the other works out the similarity of identical uygur characters mainly by means of removing those affixes. By comparison, the latter works better. Moreover, based on words clustering, this paper furtherly improves the word alignment GIZA++training process to elevate its accuracy.
Keywords/Search Tags:Word length characteristics, Characteristics of morphological, K-Means clustering, GIZA++
PDF Full Text Request
Related items