Font Size: a A A

Research On New Word Discovery And Entity Recognition Of Chinese Electronic Medical Records

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:T JiangFull Text:PDF
GTID:2404330614960450Subject:Computer technology
Abstract/Summary:PDF Full Text Request
New Word Discovery and Named Entity Recognition are two important research topics in the field of data mining.New word discovery technology can recognize out of vocabulary words and improve the accuracy of Chinese word segmentation.Named entity recognition technology can accurately identify various types of named entities,which is one of the most important techniques for constructing a knowledge graph.The Chinese electronic medical records are the professional records of medical staff for the entire process of the patient’s consultation.Because the text contains a lot of real clinical medical knowledge,it has attracted the attention of scientific researchers.Using natural language processing technology to fully dig out this knowledge will greatly promote the construction of medical information.Therefore,the research work in this thesis is as follows:(1)In this thesis,we propose an improved new word discovery method.The method first performs unsupervised pre-segmentation based on the N-gram model,and then uses the word frequency,mutual information,and branch entropy as the main features to perform new word discovery.After obtaining candidate words,we combine the grid search method to obtain the optimal feature threshold combination.On the corpus of four different fields,we compare the improved new word discovery method with the method of pre-segmentation using general tools.The experimental results verify that this method has good domain adaptability.Especially for the electronic medical record corpus,the accuracy of the first 10% of new words reached 85.9%,and its effect significantly exceeded the comparison method.(2)For the problem of named entity recognition of Chinese electronic medical records,we propose an improved method.This method first uses an unsupervised new word discovery method to build a domain dictionary to improve the accuracy of Chinese word segmentation,and then uses the BI-LSTM-CRF method for named entity recognition.The experiment is performed on the electronic medical records corpus,and the results show that the F1-Measure of the entity increased by 1.46% after adding the dictionary in the medical field.(3)For the problem of fewer high-quality annotated texts in the field of electronic medical records,this thesis proposes a method for named entity recognition by combining BERT model.This method uses the BERT model to vectorize texts and uses BI-LSTM-CRF as a fine-tuning method for entity recognition.While in the experiment,this thesis compares the entity recognition results in the different training methods,different fine-tuning methods and whether further training the language model.The results show that the best method is obtained by using BERT as the language model and using the fine-tuning method of BI-LSTM-CRF.Finally,the F1-Measure of entity recognition reaches 83.39%,and further pretraining the language model can also improve the F1-Measure by about 0.54%.
Keywords/Search Tags:Chinese electronic medical records, natural language processing, new word discovery, named entity recognition
PDF Full Text Request
Related items