Font Size: a A A

Tibetan Segmentation And POS Tagging Study

Posted on:2015-02-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:C J KangFull Text:PDF
GTID:1265330431466227Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
Tibetan information processing technology has been developed over twentyyears. Whether in the aspect of Tibetan information processing research, or in theaspect of application development, great achievements have been made. Tibetaninformation processing technology have gradually entered into the languageinformation processing level. Although Tibetan information processing is followingEnglish and Chinese technically, the research-based Tibetan corpus for informationprocessing are relatively scarce. Almost all the open corpus are untagged corpus withlimited value. The ontology research of Tibetan is not deep enough so that manyvaluable properties for Tibetan information processing cannot be mined anddescripted, and application development and scope of Tibetan informationprocessing technology are limited. To solve the above problems, we adopt severalstatistical models and methods to study the Tibetan word segmentation andpart-of-speech tagging. Finally we made the achievements in the following aspects:First, we put forward the Tibetan word segmentation method based on wordposition, which early took full advantage of Tibetan abbreviated forms in the Tibetanword segmentation research both at home and abroad.We adopted a statistical method based on word position to deal with Tibetanword segmentation, which turns Tibetan word segmentation into sequence labelingtask, and established a Tibetan word segmentation system. The system is based onconditional random field and improved4-tag set in Chinese word segmentation to6-tag set according to the grammar features of Tibetan abbreviated forms, which ismore suitable for Tibetan word segmentation. We trained the conditional randomfield model with the corpus of more than1million syllable characters which wereproofread manually. The large-scale corpus experiment shown that the F value of thesystem reached91%, which is satisfactory, in open test. In the further research, wefound out that the precision was limited by the recognize results of Tibetanabbreviated forms. In consideration of the complexity of Tibetan abbreviated forms,we summarized predecessors’ research results and introduced a post-processing module based on rules. In the final experiment, the F value of open test reachedmore than95%, which means the system has been able to meet the actual demandof the construction in Tibetan corpus.Second, we study the features of Tibetan name and discuss a recognitionmethod based on the research of Tibetan word segmentation.Through the research on Tibetan names, we summarized several strategies onTibetan name recognition and finally choose an approach based on statistics torealize the Tibetan name recognition. The approach is still based on conditionalrandom field, while use the features of boundaries, prefix and suffix, and context ofTibetan names. The experiment shown that the F value of the approach reached91.26%in open test. Regrettably we did not solve the problem on identifying theTibetan name and general words that have the same forms, which damp theperformance of recognition. However, through adjusting the tag set and optimizingthe feature templates, we should be able to improve the performance of Tibetanname recognition.Third, we used a combination of several statistic models to study the Tibetanpart-of-speech tagging. For the first time, we used the maximum entropy modelcombined with conditional random field model to achieve a Tibetan part-of-speechtagging method.Through the research on Tibetan part of speech, we first simply the Tibetan partof speech tagging set to a usable size for the statistic model, then use the maxentropy model to construct a Tibetan part-of-speech tagging system, and train it withsmall-scale corpus. The experiment shown that the precision of the Tibetanpart-of-speech tagging system based on max entropy model reached87.76%, whichis almost meet the demand of lexical analysis.Based on the research of max entropy model, we put forward an errorcorrection model with conditional random field. The error correction model wastrained with the outputs of max entropy model so that it could pick out the righttagging result from the three outputs of the highest probability and improve theprecision of the Tibetan part-of-speech. The experiments shown that, with the same train and test corpus, the mixed model combined max entropy with conditionalrandom field reached89.12%accuracy and was close to the same level of Chinesepart-of-speech tagging system.Forth, we achieved an integration model of Tibetan word segmentation andpart-of-speech tagging, which is based on conditional random field. To integrateTibetan word segmentation and part-of-speech tagging into a unified system, we putforward a new approach for Tibetan lexical analysis.We took full advantage of the deep dependencies in the word segmentation andpart-of-speech tagging, and used the lexical information to deal with the ambiguityproblem in word segment processing. In a small-scale of training corpus, the F valueof the integration model reached89.0%, which proved the integration modelcombined the word position information with the part-of-speech context well andcould be more effective in the improvement of word segmentation precision. Theperformance of our integration model was able to meet the corpus’ demand ofautomatic word segmentation. Though the precision of part-of-speech of theintegration model reached85.35%, which was still behind the independentpart-of-speech tagging model, we should be able to improve its performance ofpart-of-speech by expanding the scale of training corpus.
Keywords/Search Tags:Tibetan abbreviated forms, Tibetan word segmentation, conditionalrandom field, Tibetan name recognition, Tibetan part-of-speech, max entropy model, integration of word segmentation and part-of-speech
PDF Full Text Request
Related items