Font Size: a A A

Research On Chinese Word Segmentation Method Based On Active Learning

Posted on:2020-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:M Q HeFull Text:PDF
GTID:2438330620955604Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is the basic work of Chinese information processing application.The result of the Chinese word segmentation directly affects the subsequent links of related applications.At present,the Chinese word segmentation system based on supervised learning method has been widely used in industry and achieved good results.However,the supervised learning method relies on a large scale of manual tagging data.In the task of Chinese text segmentation in special fields,the Chinese word segmentation system based on supervised learning method is not practical,because of the scarcity of annotation data in this field and the huge cost of manual annotation for annotating a large number of texts.How to maintain the performance of the classification model based on a little of annotated data has been extensively studied.Active learning method is an effective and practical solution.This paper used active learning method to study Chinese word segmentation.This paper study on Chinese word segmentation adopted active learning method.Furthermore,this study improved the model training process in the active learning framework.Then,a semi-automatic tagging system based on this study is implemented to optimize the traditional tagging method.The main research work of this paper is as follows:(1)Research on Chinese word segmentation Based on Active Learning: this method uses conditional entropy to measure the uncertainty of samples,and selects the data with the highest uncertainty as the most valuable data to recommend to manual annotation.In this way,a high-performance Chinese word segmentation can be trained on a small scale labeled data set.(2)Training word segmentation with semi-supervised learning method: the method in work(1)just use a very small amount of informative data for model training,it does not use a large amount of unlabeled data.Using EM algorithm which makes full use of labeled data and unlabeled data to train Chinese word segmentation.This method can improve both the performance and the generalization ability of Chinese word segmentation.But the examples selected by the active learning method may not be the closest to the decision surface.So,the decision surface can’t be found quickly,and the time of training word segmentation is too long.This will make it less practical.So,work(3)is proposed.(3)Research on active learning Chinese word segmentation method based on adversarial classifier reverse engineer generative examples: this method uses generator to generate pseudo-examples which are more valuable than actual data based on adversarial learning.Those examples are closest to the decision surface.Through these examples,more accurate decision surface can be obtained,and the performance of word segmentation can be improved.However,the class of pseudo-instances is determined by the Chinese word segmentation,and its reliability depends on the performance of the word segmentation.So,there are many unreliable data in the generated pseudoinstances,which can easily lead to poor segmentation effect.How to improve the reliability of the generative examples needs further study.
Keywords/Search Tags:Active Learning, Chinese Word Segmentation, Semi-automatic Labeling, Generative Examples
PDF Full Text Request
Related items