Font Size: a A A

Research On Automatic Notation Of Word For Tibetan Corpus Based On Hmm

Posted on:2011-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:J F SuFull Text:PDF
GTID:2195330332969984Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
In recent years, corpus linguistics has developed more rapidly, which opened a new path to the language research. The construction of English, Chinese corpus and word frequency statistics has laid a reliable and solid foundation on minority language for quantitative study at different levels, and drew on the experience from them. The development of Tibetan language information processing technology and the achievement of Tibetan research have created conditions for Tibetan language corpus and for word frequency statistics.Tibetan part of speech tagging is a fundamental issue in the information processing technology. On the one hand, its research achievement can be directly integrated into the information extraction, information retrieval, machine translation and many other practical applications; on the other hand, Tibetan automatic part of speech tagging is also a necessary front-end tool for Tibetan language block recognizer, the Tibetan syntax parser and Tibetan semantic parser. Therefore, the research and implementation of Tibetan part of speech tagging device has an important theoretical significance and a practical value.The methods of part of speech tagging include two major categories, the rule-based method and the statistics-based method. The statistics-based method has gradually become a research hotspot because it has advantages, it does not need artificial linguistic rules, it has the highly correct recognition rate, etc. HMM is one of the most important algorithm models in the statistics-based methods.This paper mainly studied the statistics-based part of speech tagging and realized the Tibetan part of speech tagging system, which got statistics data on the training corpus through Hidden Markov Model, obtained the needed part of speech tagging and vocabulary probability information, and tagged by Viterbi algorithm. To the data sparseness problem caused by the smaller training Tibetan corpus, it used a simple but efficient "addition" data smoothing algorithm for data smoothing, effectively avoided the problem of accuracy decrease of part of speech tagging caused by data sparseness.This experimental research is a tentative study which is about the automatic processing of Tibetan corpus. The research shows that the automatic annotation for part of speech on the Tibetan corpus can be realized through the way of HMM. And the closed test marked 88%--90% correct rate in the system..
Keywords/Search Tags:Tibetan, corpus, dictionary, part of speech tagging, Hidden Markov Model
PDF Full Text Request
Related items