Research on Chinese medical named entity recognition is of great significance in building Chinese medical knowledge graphs,intelligent diagnosis systems,and accelerating the digitization and intelligent transformation of medical fields.However,the existing Chinese medical named entity recognition datasets suffer from the problem of sample imbalance,which seriously affects the performance of the models.For example,the CCKS2017 dataset contains 10,719 entities in the Body class but only 722 entities in the Disease class.The model performs well in the Body class but poorly in the Disease class.To address the sample imbalance issue in Chinese medical datasets,this paper proposes two methods for Chinese medical named entity recognition.By enhancing the model’s feature extraction capability,significant performance improvements have been achieved on few-shot data.The main contributions are as follows:(1)A named entity recognition method based on word-level segment information fusion is proposed.The existing segment information fusion methods extract the same segment information for each word,without considering the relevance between words and entities.To alleviate this issue,a word-level extraction method WL-SIE is proposed,which can extract segment information that is more relevant to words based on the association between words and entities.Experimental results show that the WL-SIE method can improve the overall performance of the model by compensating for the low performance caused by sample imbalance.(2)A dual-module collaborative training method for named entity recognition is proposed.The existing Chinese medical named entity recognition datasets only label category tags and do not independently label boundary tags,which cannot provide sufficient entity boundary information for models.To address the issue of how to learn boundary information on medical datasets,this paper designs two modules for collaborative training.Among them,an entity discrimination module is built through deep reinforcement learning to provide boundary information,and a supervised learning-based entity category prediction module is built to predict categories.Finally,the collaborative model DT is obtained for Chinese medical named entity recognition.Experimental results show that the DT model can compensate for the performance deficiency caused by sample imbalance by introducing boundary information.(3)This paper conducted sufficient experiments on the CCKS2017 and CMeEE two general datasets to verify the effectiveness of the proposed methods for WL-SIE and collaborative model DT.Experimental results show that compared with the baseline model BERT-BiLSTM-CRF,the WL-SIE method proposed in this paper increases the F1 value by 0.05 and 0.30,respectively,and the collaborative model DT increases the F1 value by 1.73%and 4.62%,respectively,which proves that the proposed methods can effectively alleviate the problem caused by sample imbalance.In addition,the WL-SIE method outperforms two existing segment information extraction methods and shows better performance in long entity recognition. |