Research On Automatic Notation Of Word For Tibetan Corpus Based On Hmm

Posted on:2011-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:J F Su

Full Text:PDF

GTID:2195330332969984

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

In recent years, corpus linguistics has developed more rapidly, which opened a new path to the language research. The construction of English, Chinese corpus and word frequency statistics has laid a reliable and solid foundation on minority language for quantitative study at different levels, and drew on the experience from them. The development of Tibetan language information processing technology and the achievement of Tibetan research have created conditions for Tibetan language corpus and for word frequency statistics.Tibetan part of speech tagging is a fundamental issue in the information processing technology. On the one hand, its research achievement can be directly integrated into the information extraction, information retrieval, machine translation and many other practical applications; on the other hand, Tibetan automatic part of speech tagging is also a necessary front-end tool for Tibetan language block recognizer, the Tibetan syntax parser and Tibetan semantic parser. Therefore, the research and implementation of Tibetan part of speech tagging device has an important theoretical significance and a practical value.The methods of part of speech tagging include two major categories, the rule-based method and the statistics-based method. The statistics-based method has gradually become a research hotspot because it has advantages, it does not need artificial linguistic rules, it has the highly correct recognition rate, etc. HMM is one of the most important algorithm models in the statistics-based methods.This paper mainly studied the statistics-based part of speech tagging and realized the Tibetan part of speech tagging system, which got statistics data on the training corpus through Hidden Markov Model, obtained the needed part of speech tagging and vocabulary probability information, and tagged by Viterbi algorithm. To the data sparseness problem caused by the smaller training Tibetan corpus, it used a simple but efficient "addition" data smoothing algorithm for data smoothing, effectively avoided the problem of accuracy decrease of part of speech tagging caused by data sparseness.This experimental research is a tentative study which is about the automatic processing of Tibetan corpus. The research shows that the automatic annotation for part of speech on the Tibetan corpus can be realized through the way of HMM. And the closed test marked 88%--90% correct rate in the system..

Keywords/Search Tags:

Tibetan, corpus, dictionary, part of speech tagging, Hidden Markov Model

PDF Full Text Request

Related items

1	Research On Automatic Notation Of Word For Tibetan Corpus Based On Hmm
2	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On Pre-trained Language Models
3	Tibetan Segmentation And POS Tagging Study
4	Research And Implementation Of The Tibetan Part Of Speech Tagging System
5	The Research On Tibetan Speech Recognition Technology
6	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On GNN
7	A Stuty And Analysis On The Sixth Edition Of Modern Chinese Dictionary’s Part Of Speech Tagging
8	A Contrastive Study On The Part Of Speech Tagging Of Dictionary Of Contemporary Chinese And Grammatical Knowlege-base Dictionary
9	Research On The Speech Synthesis Technology Of Tibetan Dialect
10	Text Analysis Of Speech Synthesis Based On Statistical Parameters Of Tibetan Language In Specific Fields