The research on Named Entity Recognition(NER)in Chinese medical text is of great significance to medical information extraction.However,it is difficult to obtain labeled data in the medical field,so the development of Chinese medical text NER has been limited by the problem of low resources.Low resource means lack of labeled data,which can seriously affect the performance and generalization of the model.To cope with the lack of labeled data in low-resource scenarios,this paper proposes two Chinese medical named entity recognition methods.The main work is as follows.(1)Incorporate lexicon into self-training: a distantly supervised Chinese medical NER(LSCNER)method is proposed.Firstly,a self-training-based entity high recall method is proposed to effectively recall potential unlabeled entities;secondly,a scoring and ranking method based on fine-grained dictionary enhancement is proposed to model the unique internal structure of medical entities,The recalled entities can be screened,and the false entities obtained by the high-recall method of entities can be effectively reduced.In addition,this paper constructs a Chinese medical NER dataset CDD.The experimental results show that on the dataset CDD constructed in this paper and the public dataset CCKS 2019,compared with the baseline model,the LSCNER improves the F1 by 3.20% and 5.03%,respectively.(2)Enhance both Text and Label: A Chinese medical NER(TLCNER)method is proposed.The method utilizes pre-trained language models and semisupervised learning to optimize from both text and label dimensions.First,a text-enhanced Chinese medical NER method based on a pre-trained language model is proposed.This paper searches 200,000 medical texts from the Internet,and continues to pre-train the public pre-training model for medical field adaptation.The texts of the two public datasets are augmented by data,and continue to be pre-trained for task adaptation.Secondly,a semi-supervised label-enhanced Chinese medical NER method is proposed.In this paper,the semi-supervised learning method is used to process unlabeled data,obtain pseudo-labeled data,and add pseudo-labeled data to the original training data to improve the data labeling.diversity.Finally,on two low-resource public datasets,the TLCNER improves the F1 by 2.68% and 3.66%,respectively,compared to the BERT-base model. |