Font Size: a A A

Research On Biomedical Named Entity Recognition Based On Weak Supervision

Posted on:2022-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:2480306509984619Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Biomedical named entity recognition is the first step of biomedical information extraction.Recently,deep neural networks have successfully applied to biomedical named entity recognition.However,deep neural network models require large-scale high-quality annotated datasets to reliably estimate the parameters.The manual annotated datasets are small in scale,making them difficult to train high-performance deep neural network models.Therefore,automatically constructing large-scale and high-quality weakly supervised datasets has become an effective way to improve the performance of biomedical named entity recognition.(1)Construction of two-perspective weakly supervised datasetsThis thesis proposes to to automatically construct weakly supervised datasets using large-scale unlabeled literatures and knowledge bases in the biomedical domain.First,two weakly supervised datasets are automatically constructed with PubTator and knowledge bases from the perspective of recall and precision,respectively.Then,in order to fully recognize named entities,a named entity recognition model is trained on the weakly supervised dataset from the perspective of recall.Finally,to improve the precision of named entity recognition,the named entity recognition model is further refined with curriculum learning and masking operation on the weakly supervised dataset from the perspective of precision.Experiments on the CDR and NCBI disease corpus show that this approach achieves better performance than other weakly supervised approaches,which proves the effectiveness of the approach of automatic construction weakly supervised datasets of this thesis,and the complementary relationship between the two-perspective weakly supervised datasets.(2)Biomedical entity recognition based on label re-correctionThis thesis proposes a label re-correction approach for weakly supervised dataset using human annotated dataset.First,a label correction model is trained on the weakly supervised dataset and human annotated dataset.Then,the label correction model is used to correct the noisy labels in the weakly supervised dataset,obtaining the corrected weakly supervised dataset.Considering the large gap between the quality of weakly supervised dataset and human annotated dataset,weakly supervised dataset is corrected iteratively to obtain high-quality weakly supervised dataset.Finally,we use the two perspective high-quality weakly supervised datasets to train two named entity recognition models,respectively,which are further fused with knowledge distillation.Experiments on the CDR,NCBI disease and CHEMDNER corpus show that this approach achieves state-of-the-art performances.It is proved that label re-correction can continuously improve the quality of weakly supervised datasets,and knowledge distillation can effectively integrate the named entity recognition models from two perspectives.(3)Biomedical entity recognition based on pseudo parallel dataset correctionThis thesis proposes to build pseudo parallel dataset for correcting a large number of noisy labels once by using the human annotated dataset and weakly supervised dataset.First,we train a knowledge acquisition model on the weakly supervised dataset with curriculum learning,which is used to recognize the named entity in the training set of human annotated dataset for obtaining the weak labels of the training set.The human annotated labels and weak labels of the training set are parallel to form a pseudo parallel dataset.Then,the pseudo parallel dataset is used to train a noisy correction model,which is used to correct the noisy labels in the weakly supervised dataset for obtaining high-quality weakly supervised datasets.Finally,the label masking and Partial-CRF are used to fuse the two-perspective weakly supervised datasets,respectively.Experiments on the CDR,NCBI disease and CHEMDNER corpus show that this approach is superior to the biomedical entity recognition approach based on label re-correction.It is proved that the pseudo parallel dataset correction can efficiently improve the quality of weakly supervised datasets,and the label masking and Partial-CRF can effectively integrate the weakly supervised datasets from two perspectives.
Keywords/Search Tags:Biomedical Named Entity Recognition, Weakly Supervised Dataset, Label Correction, Precision Perspective, Recall Perspective
PDF Full Text Request
Related items