| For a long time,Chinese word segmentation has been regarded as the first stop of Chinese information processing.The named entity is often the most concerned component of the sentence,and the output of the Chinese word segmentation task is used as the input of the named entity task,so if the relevant algorithms are optimized,the speed and accuracy of Chinese word segmentation named entity prediction can be improved,that is,the running speed and accuracy of lexical analysis can be improved,then the performance of the whole natural language processing task can be improved.So that the computer can better understand Chinese,which has a very important research significance.Now the popular open source word segmentation tools are stuttering,Pangu,Ansj participle,etc.,these participle output the accuracy of the final word segmentation is only about 80%,there is a lot of room for improvement.On the basis of the perceptron algorithm model,the average perceptron model is optimized by gradient descent method,and in the process of training,the optimized perceptron algorithm is improved so that multi-thread training can be adopted.the accuracy and speed of Chinese word segmentation prediction are improved.Since the trained corpus is obtained through web crawlers,a general web crawler application is first implemented through the Scrapy framework,and more than 2.5million question-and-answer pairs of data are obtained.moreover,because the quality of the data has a very important impact on the effect of machine learning,and the data obtained by web crawlers often contain a large number of web page tags,it is necessary to clean the collected data.Among them,stop word filtering is the most important part of this process,so a dictionary tree data structure for Chinese word matching is designed,and the KMP matching algorithm is optimized to obtain high-quality data quickly.The object entities that people pay most attention to can be counted as named entities.In most cases,the core of the information extraction task can also be identified as a named entity.Therefore,named entity recognition is also a very important part of Chinese natural language processing.In this paper,the quasi-Newton method is used to optimize the conditional random field model,which can improve the speed of named entity recognition and improve the accuracy of named entity recognition.Experiments show that the optimized word segmentation algorithm finally improves the accuracy of Chinese word segmentation prediction to nearly 96.7%.At the same time,the total training time is reduced from128 seconds to 59 seconds.By using the upgraded version of the matching algorithm,the time complexity of stopping word filtering can be increased from O(n)to O(logn).In named entity recognition,through numerical optimization,the storage and calculation of Hessian matrix is avoided,and the time complexity of the algorithm is increased fromO(n~2)to O(n(9)m),where m(28)n.Through the convex optimization of the conditional random field model by quasi-Newton method,the accuracy of identifying named entities has been improved by 2.7%compared with that before optimization.Finally,the training model is encapsulated into an interface,which is called by We Chat Mini Programs,and a simple question answering system is realized. |