| In recent years,with the rapid development of biomedical technology and computer technology,a huge amount of biomedical text data has been generated,and how to effectively process and fully utilize these data has become an important issue.Biomedical named entity recognition is an important task in biomedical text information processing and has an important impact on its downstream tasks such as entity relationship extraction,question and answering systems,document classification,etc.Biomedical named entity recognition aims to identify biomedical entities such as diseases,genes,chemicals,etc.in biomedical texts and label their types.Compared with named entity recognition tasks in general-purpose domains,there are still many difficulties in carrying out named entity recognition tasks in biomedical domains.Biomedical entities such as chemicals,genes,proteins,etc.covered by biomedical text data usually consist of long sequences,which are not only numerous but also complex in structure.What’s more,it is common in the biomedical entities to be recognized of abbreviations,aliases,nesting,mixed case,multiple meanings,and other naming irregularities,making it difficult for existing work to learn feature representations of biomedical text data relying only on a single deep learning model(e.g.,convolutional neural networks,bidirectional gated recurrent units,or attention mechanisms).To address the above problems,this thesis presents a systematic study of deep neural network-based biomedical named entity recognition methods,with the following main research components.(1)In this thesis,a biomedical named entity recognition model based on combined feature embedding and multi-task learning is proposed(BC_MT_Bio NER).The model mainly consists of a shared layer and a task-specific layer,in which the shared layer fuses the contextual word embedding vector generated by Bio BERT with the character embedding vector generated by Char CNN to obtain vector representation information with both word and character features,which effectively solves the problem of inadequate extraction of semantic features of biomedical text data by existing methods.In addition,Bi GRU with global attention mechanism is used in the task-specific layer to capture adjacent characters and sentence context information,and finally,CRF is used to predict the sequence labels.The model treats each dataset in the 15 biomedical datasets as an independent task,employs a specific module for different tasks,and enables the model to acquire common features among different tasks by training all datasets jointly to improve the generalizability of the model.The experimental results show that the average F1 value of BC_MT_Bio NER model on 15 commonly used biomedical datasets can reach up to 85.51%.(2)This thesis proposed a deep neural network framework jointed biomedical named entity identification and normalization(BCBA_GS_Bio NER).By jointly modeling the biomedical named entity recognition task and the biomedical named entity normalization task in the biomedical domain,the interactions between the two tasks are fully utilized to reduce the error propagation problem and to effectively improve the accuracy of the biomedical named entity recognition task using the biomedical named entity normalization task.The framework mainly includes recognition module,query module,and fusion module.In the recognition module,Bio BERT,a pre-trained language model,is used to replace the traditional static word vector representation to dynamically generate contextual word embedding vectors,which are stitched with the character embedding vectors generated by Char CNN and then input into Bi LSTM to obtain more adequate semantic information.In the query module,a feature vector of standard entities is generated using Bio BERT and the correlation between the biomedical entities of the input text and the standard entities is calculated using the attention mechanism.In the fusion module,the feature information output from the recognition module and the query module is fused using the gate mechanism,and finally the labels of the biomedical entities corresponding to the standard entities in the text data are output by the Softmax classifier.The BCBA_GS_Bio NER model was experimented on the NCBI and BC5 CDR datasets,and the results showed that the model outperformed the comparison model. |