Font Size: a A A

Research On Key Technologies In Thai Lexical Analysis

Posted on:2017-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ZhaoFull Text:PDF
GTID:2358330488450198Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
The lexical analysis about Thai language is the basis of Thai language information processing. Thai lexical analysis directly affects subsequent Thai language information processing and cross-language research or application. Because the Thai words are constituted by a plurality of characters and have no definite natural delimiter in the form between words, and sentences are taged by a space character. The difficulty Thai lexical analysis is greatly increased because of the Thai language features. Combined with the features of Thai, the dissertation discussed the syllable segmentation, word segmentation, part-of-speech (POS) tagging, and Named Entity Recognition (NER) four issues for Thai lexical analysis. The main results are as follows.(1) Combined with the Thai syllabic features which consists of consonants character, vowels character, and tone character, the dissertation implement the Thai syllable segmentation by using Conditional Random Fields (CRFs). The method combines the defined features of Thai alphabet letter categories and position to implement the Thai sequence labeling by using CRFs. Experimental results show that this method can effectively utilize location information character and the characters category feature, make the Thai syllable segmentation to obtain a good segmentation effect.(2) Combining the features of Thai words which consist of syllables. In order to effective use the features of Thai whose syllables consist of multiple characters as well as the characteristics of word formation composed by syllable, this dissertation presents a word segmentation method of Thai based on dual-layer CRFs which takes advantage of the features of character and syllable. Firstly, by analyzing the features of Thai word formation and utilizing the features of the constitution of syllable, the dissertation use the first layer CRFs model to segment the Thai syllable based on a character. Then, according to the features of the construction of words, the second layer CRFs model is utilized to label the word based on the output of the first layer. By merging features of character and syllable, taking the syllable as a bridge, it can decrease the influence of character sequence labeling errors on the Thai word segmentation results. Furthermore, using the combination features of syllable, it can utilize large-grained syllable context to compose of words. Experimental results show that the dual-layer CRFs can effectively use context information, more conducive to word sense disambiguation, and features is lesser by comparing with the CRFs Thai word segmentation model, and improve the accuracy of the Thai word segmentation and the decoding speed faster.(3) The dissertation implement Thai POS tagging by using Hidden Markov Models (HMM) and CRFs model, and proposed a method of merging the Thai word vector neural network Thai POS Tagging. Experimental results show that the method of neural network Thai POS tagging leaded to higher accuracy than HMM and CRFs Thai POS tagging methods.(4) The dissertation implement Thai Named Entity Recognition by using CRFs. In addition, the dissertation proposed a method of neural network Thai Named Entity Recognition. Because the label corpus of Thai Named Entity Recognition is rare, the method of Thai Named Entity Recognition can’t get enough information in the neural network, and the entity word can’t get full training when the 0 and 1 code word vector is embed to neural network training in the Named Entity Recognition. Therefore, the word vector and neural network model for Thai Named Entity Recognition training is separately. The word vector training is standalone use of a large number of data without a label, then with trained term vectors as the input of the neural network Thai Named Entity Recognition model, to improve the result of Thai Named Entity Recognition.
Keywords/Search Tags:Lexical Analysis, Thai Language Processing, Conditional Random Fields, Word Vector, Neural Network
PDF Full Text Request
Related items