Research On Thai Word Segmentation And Part-of-speech Tagging Based On Multi-granularity Feature

Posted on:2023-03-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Wang

Full Text:PDF

GTID:2555306797482604

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Thai word segmentation and part of speech tagging are the basic tasks of Thai natural language processing and the necessary processing steps of many Thai subsequent natural language processing tasks.Thai word segmentation and part of speech tagging have an important impact on the tasks of Thai text classification,information retrieval and machine translation.Therefore,the research on Thai word segmentation and part of speech tagging has a good application value.In English,space is the obvious boundary of words.However,for Thai,there is no space between words.Space is only used as a separator between sentences,so word segmentation is particularly necessary.In the part of speech tagging task,there is a certain dependency between the syllable as the word forming unit and the part of speech of the word,but the existing models do not extract this dependency feature sufficiently.At present,there are many tools for Chinese and English word segmentation and part of speech tagging at home and abroad.However,due to the differences between Thai and Chinese and English,the existing Chinese word segmentation and part of speech tagging models cannot be directly extended to Thai.This thesis focuses on the research of word segmentation and part of speech tagging methods in Thai lexical analysis.By integrating multi granularity features,the feature information is complementary,and the dependency between features is effectively used to improve the performance of Thai word segmentation and lexical analysis.This thesis has achieved the following research results:(1)A joint segmentation method of local multi headed attention syllables and words integrating characters and character categories is proposed.Since each Thai word is composed of one or more orthographic syllables,word segmentation usually requires pipeline steps to use syllable information.Therefore,this thesis proposes a Joint Cut neural network model using syllable information based on multi task paradigm.The model uses the local multi head attention mechanism to capture the contextual features of each character and character category,and then uses a separate neural classifier to predict the word and syllable labels at the same time.The experimental results show that Thai word segmentation can obtain good segmentation effect in this method,including performance and segmentation speed.The segmentation speed is at least 4times faster than the most advanced Thai word segmentation model.(2)A local multi headed attention part of speech tagging method integrating word syllable pair features is proposed.As the basic constituent unit of Thai language words,syllable is related to its part of speech,which is a useful feature of part of speech tagging.Based on the local dependence and word formation characteristics of this kind of language,a part of speech tagging model is proposed,which uses the local multi head attention mechanism to learn the contextual features from the sequence of local word syllable pairs,and uses the conditional random field to model the part of speech dependence.The experimental results of Thai part of speech tagging data sets show that compared with the current optimal model,the average F1 value of macro is increased by1%,and the labeling effect of low-frequency parts of speech and unknown words is significantly improved.(3)A cascading network Thai word segmentation and part of speech tagging method integrating words and syllables is proposed.At present,part of speech tagging usually requires pipeline steps to use word segmentation information,which is vulnerable to the spread of word segmentation errors.Therefore,a cascading network model integrating multi granularity features of Thai word segmentation and part of speech tagging is proposed.The model formalizes the task of Thai word segmentation and part of speech tagging as a sequence tagging problem.The model uses multi head attention mechanism and conditional random field architecture to predict the word boundary labels B and I from the input syllable,so as to generate a word segmentation sequence,and then predict the part of speech label of each word in the sequence based on the word segmentation sequence and word sequence.The experimental results show that the accuracy of the joint model is higher than that of step-by-step word segmentation and part of speech tagging,which improves the part of speech prediction results of ambiguous words,and the accuracy can reach more than 96% on the public data set.(4)Based on the above research results,a prototype system for Thai text lexical analysis is designed and implemented,which realizes the functions of Thai text word segmentation and part of speech tagging.Input the Thai sentences to be processed into the analysis text box,and then the system will break the input text in advance,and then input the processed sentences into the model to obtain the results of clause and part of speech tagging,which provides users with visual Thai text processing services.

Keywords/Search Tags:

Word Segmentation, Part-of-Speech Tagging, Syllable, Local Multi-head Attention, Conditional Random Field

PDF Full Text Request

Related items

1	Entity Recognition And Part - Of - Speech Tagging Of Ancient Chinese Chronology
2	Tibetan Segmentation And POS Tagging Study
3	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On Pre-trained Language Models
4	Research On Automatic Word Segmentation Of Zuo Zhuan Based On Conditional Random Field
5	Research On Tibetan Word Segmentation And Part-of-speech Tagging Based On GNN
6	Research On The Methods Of Ancient Chinese Word Segmentation And Part-of-speech Tagging
7	Experimental Study On The Fusion Of Dictionary Segmentation And Model Word Segmentation In Chinese
8	The Research On The Word Formation And The Usage Of Morphemes With Multiple Part-of-speech Tagging Monosyllabic Compound-word
9	Research And Implementation Of The Tibetan Part Of Speech Tagging System
10	Research On Word Segmentation And Part-of-speech Of Tibetan On Neural Network