Optimization And Implementation Of Chinese Lexical Analysis Algorithm For Chat Robot

Posted on:2021-04-26

Degree:Master

Type:Thesis

Country:China

Candidate:X X Dou

Full Text:PDF

GTID:2518306104488514

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

For a long time,Chinese word segmentation has been regarded as the first stop of Chinese information processing.The named entity is often the most concerned component of the sentence,and the output of the Chinese word segmentation task is used as the input of the named entity task,so if the relevant algorithms are optimized,the speed and accuracy of Chinese word segmentation named entity prediction can be improved,that is,the running speed and accuracy of lexical analysis can be improved,then the performance of the whole natural language processing task can be improved.So that the computer can better understand Chinese,which has a very important research significance.Now the popular open source word segmentation tools are stuttering,Pangu,Ansj participle,etc.,these participle output the accuracy of the final word segmentation is only about 80%,there is a lot of room for improvement.On the basis of the perceptron algorithm model,the average perceptron model is optimized by gradient descent method,and in the process of training,the optimized perceptron algorithm is improved so that multi-thread training can be adopted.the accuracy and speed of Chinese word segmentation prediction are improved.Since the trained corpus is obtained through web crawlers,a general web crawler application is first implemented through the Scrapy framework,and more than 2.5million question-and-answer pairs of data are obtained.moreover,because the quality of the data has a very important impact on the effect of machine learning,and the data obtained by web crawlers often contain a large number of web page tags,it is necessary to clean the collected data.Among them,stop word filtering is the most important part of this process,so a dictionary tree data structure for Chinese word matching is designed,and the KMP matching algorithm is optimized to obtain high-quality data quickly.The object entities that people pay most attention to can be counted as named entities.In most cases,the core of the information extraction task can also be identified as a named entity.Therefore,named entity recognition is also a very important part of Chinese natural language processing.In this paper,the quasi-Newton method is used to optimize the conditional random field model,which can improve the speed of named entity recognition and improve the accuracy of named entity recognition.Experiments show that the optimized word segmentation algorithm finally improves the accuracy of Chinese word segmentation prediction to nearly 96.7%.At the same time,the total training time is reduced from128 seconds to 59 seconds.By using the upgraded version of the matching algorithm,the time complexity of stopping word filtering can be increased from O(n)to O(logn).In named entity recognition,through numerical optimization,the storage and calculation of Hessian matrix is avoided,and the time complexity of the algorithm is increased fromO(n~2)to O(n(9)m),where m(28)n.Through the convex optimization of the conditional random field model by quasi-Newton method,the accuracy of identifying named entities has been improved by 2.7%compared with that before optimization.Finally,the training model is encapsulated into an interface,which is called by We Chat Mini Programs,and a simple question answering system is realized.

Keywords/Search Tags:

Chinese word segmentation, average perceptron, filtering stop words, named entity recognition, conditional random field

PDF Full Text Request

Related items

1	Research Of Named Entity Recognition Based On Conditional Random Fields
2	Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field
3	Research And Application Of The Chinese Organization Names Recognition And Disambiguation
4	Research On The Key Technology Of Named Entity Recognition And Relation Extraction In Military Field
5	Named Entity Recognition Based On Conditional Random Fields Chinese Research
6	Research On Chinese Named Entity Recognition Technology Based On Neural Networks
7	Application Research On Chinese Named Entity Recognition Based On Domain Ontology
8	Study On The Tibetan Word Segmentation And Named Entity Recognition With Conditional Random Fields
9	Research On Chinese Named Entity Recognition And Field Application In Inspection And Quarantine
10	A Study On Cambodian Word Method Based On Conditional Random Field