With the continuous development and progress of Internet technology,the degree of digitization and informatization in today’s society is getting higher and higher,various fields have achieved different degrees of informatization,and the medical field is no exception.The proposal and development of digital medical and smart medical have accelerated the process of medical informatization in China,the Chinese medical texts containing rich medical information have been generated.Chinese medical texts contain a large number of medical named entities,which imply a large amount of medical information related to patients’ health conditions and treatments.Recognizing these medical entities quickly and accurately from the large number of medical texts is a key step to promote the development of digital healthcare and smart healthcare.However,due to the complexity of Chinese language and the special characteristics of Chinese medical text,there are recognition errors caused by boundary recognition errors or "multiple meanings of words".In addition,there is very little annotated Chinese medical text data,and there will be low recognition performance caused by insufficient annotated data.The existence of all these problems seriously hinders the development of Chinese medical entity recognition.This thesis mainly relies on deep learning and natural language processing techniques to conduct an in-depth study on Chinese medical entity recognition tasks.It aims to solve the problems of boundary recognition errors,"multiple meanings of words" and insufficient data of annotated medical texts in the current research of Chinese medical entity recognition.The proposed Chinese medical entity recognition model has a significant improvement in performance.The specific research content includes the following parts:(1)In order to address the problems caused by boundary recognition errors in the recognition of Chinese medical entities,we propose a Chinese medical entity recognition model based on character and word fusion.The model uses character vector and word vector as the input of the model,and after feature extraction by Bi LSTM network respectively,the feature vectors output based on character vector and word vector are stitched and operated in the fusion layer,and finally the final labeling results are obtained by the CRF layer through computation.The experimental results show that the performance of the character and word fusion-based model is better than the single character vector model or the single word vector model.However,Chinese medical texts are particularly complex,and usually a word has different meanings in different contexts,so this method is not a good solution to the problem of "multiple meanings of a word".(2)In order to address the problem of "multiple meanings of words" in Chinese medical entity recognition,which leads to entity recognition errors,we propose a Chinese medical entity recognition model based on BIBC.This model adds an IDCNN layer to the classical Bi LSTM-CRF to focus more on the local information of the text,and uses the pre-training model BERT to better represent the semantic information.The experimental results show that the performance of the model is better than other advanced comparison models in the Chinese medical entity recognition task.(3)In order to address the problem of poor recognition due to insufficient labeled data in the two aforementioned models,we propose a Chinese medical entity recognition model based on semi-supervised learning.The purpose of introducing semi-supervised learning is to improve the performance of the model by using a large number of unlabeled Chinese medical texts.We use Tri-Training algorithm for semi-supervised learning.The improvements are made in terms of the division of the initial sub-training set,the construction of the base classifier,and the integration of the learning voting method.It can be learned from the experiments that the addition and improvement of the semi-supervised algorithm improves the performance of the Chinese medical entity recognition model compared to the supervised learning algorithm. |