Experimental Study On The Fusion Of Dictionary Segmentation And Model Word Segmentation In Chinese

Posted on:2020-12-18

Degree:Master

Type:Thesis

Country:China

Candidate:T T Fang

Full Text:PDF

GTID:2405330596474389

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

The process of Chinese word segmentation is to divide a series of fields into separate and recognizable fields according to certain rules.Because Chinese is composed of Chinese characters,there is no separator that is convenient for distinguishing between Chinese characters.So Chinese word segmentation is the first step in natural language processing,and it is a basic step that cannot be avoided.Whether the word segmentation results are ideal or not directly determines whether the final result is idealized.Today in the knowledge update iterations change fast,the flexibility and scientificity of the word segmentation method has higher and more requirements.The popularity of the Internet has allowed a large number of new vocabulary to emerge.The emergence of these new words is not only a reaction to the progress of the Internet.Meanwhile,it also poses new challenges to dictionary word segmentation.How to deal with these new words efficiently and quickly is the stress research object of text dictionary.At present word segmentation system mainly uses dictionary word segmentation or machine learning segmentation to complete the word segmentation task.While dictionary segmentation is controllable and fast,but it can’t correctly segment words for unregistered words.The machine learning model CRF can solve the problem of unregistered words.Training CRF requires manual design of many features,and it takes a lot of time to verify the validity of the feature.After the emergence of algorithms for natural language processing based on in-depth learning.But its controllability is not as good as dictionary word segmentation.Failure to solve a new word mode,dictionary word segmentation can quickly add new words to the dictionary to solve problems.For the model,it may be necessary to add a lot of relevant training corpora.These training corpora are often difficult to obtain or the cost of acquisition is very large.In this case,in order to improve the effect of Chinese automatic word segmentation.The dictionary word segmentation module is implemented by using MMseg algorithm and using BI-LSTM+CRF as model word segmentation module.Finally,the two can be combined toachieve the controllability of the dictionary segmentation.It can also solve the problem of unlisted words in dictionary segmentation by model segmentation.This article is tested in the Bakeoff corpus of SIGHAN’s Chinese processing evaluation.Firstly,the MMseg dictionary word segmentation module is implemented,and then the algorithm for the model segmentation is first tuned to various parameters,and finally the results of the two are combined.Experiments show that in the results after model fusion,the quasi-de-rate,recall,and F1 values are all improved.Moreover,the model fusion can solve the problem of controllability and unregistered words.

Keywords/Search Tags:

dictionary segmentation, Chinese word segmentation, conditional random field, Natural language processing

PDF Full Text Request

Related items

1	Research On Automatic Word Segmentation Of Zuo Zhuan Based On Conditional Random Field
2	Research On Thai Word Segmentation And Part-of-speech Tagging Based On Multi-granularity Feature
3	Desigh And Implement Of Parser Based On Grammar Function And Collocation
4	Research On The Integrated Processing Technology Of Sentence Segmentation And Lexical Analysis Of Ancient Texts Based On Deep Learning
5	Research And Implementation Of Teaching Chinese As Foreign Language System Based On Chatbot
6	The Study Of Automatic Chinese Phoneticize Label Based On Automatic Word Segmentation
7	Tibetan Segmentation And POS Tagging Study
8	Research On Automatic Texts Segmentation And Word Segmentation For Ancient Chinese Texts
9	Study On CFL Learners’ Word Segmentation Mechanism In Chinese Reading ——Evidence From Eye Movements
10	The Research On Tibetan Automatic Word Segmentation Technology