| The rapid growth of the amount for biomedical literature makes biomedical text informatization increasingly important for related research.Information extraction from biomedical literature is crucial for text informatization,which is also the premise for various applications such as information retrieval,question answering and knowledge base construction.However,the speciality of biomedical literature highly restricts the efficiency of biomedical literature information extraction,which has become one of the main difficulties in the processing of biomedical text.In particular,biomedical literature contains a wealth of entity types,such as diseases,drugs and genes.These entities are morphologically diverse and irregular,such as abbreviations,aliases,homonyms and synonyms,which makes biomedical named entity recognition more challenging.To normalize diverse or irregular entity names,researchers have proposed the entity normalization task to map entity names to their corresponding normalized names in a controlled vocabulary.Although researchers have done lots of work on named entity recognition and normalization,these studies still face three major problems.(i)The scale of manually annotated corpora are limited.When using machine learning models,especially deep learning models,the performance is relatively low due to insufficient training data.(ii)The models for entity normalization face challenges to extract effective features,since biomedical entity names involve different levels of semantic information such as characters,words and sentence context.(iii)Most studies treat named entity recognition and normalization as two separate tasks.However,the pipeline model can easily lead to error propagation and cannot capture the interaction between these two closely-related tasks.In view of the above problems,this paper systematically studies the tasks of entity recognition and normalization from the aspects of knowledge fusion,feature analysis and joint models.The main contributions of this paper can be summarized as follows:(1)The lack of human annotated data has been one of the main obstacles for neural named entity recognition in biomedical domains.This paper introduces a named entity recognition model based on biomedical dictionaries and graph attention networks to alleviate the problem.First,dictionaries are used to extract entity mention candidates by a graph matching algorithm,which can capture word patterns of domain entities.Then,a word-mention interactive graph is leveraged to integrate the semantic and boundary information.Finally,the graph attention mechanism is used to reduce the noise of entity candidate information.Experimental results on the chemical and disease data sets(BC5CDR and NCBI-disease)show that the performance of the proposed model exceeds those of the state-of-the-art models,e.g.,BANNER,Collabo Net,MTMCW and Auto NER.(2)The feature representation of biomedical entities is difficult to learn effectively.This paper proposes a neural network model that extracts different levels of neural features(characters,words,sentence context)based on different neural network structures.We investigates the influence of different neural networks on entity normalization,using the strategy of fusing three widely-used neural networks(i.e.,CNN,RNN,and ATTENTION).Experiment results on two benchmark data sets(BC5CDR and NCBI disease)show that the model based on neural fusion can improve the performance of entity normalization.(3)Considering the problem of error propagation caused by entity recognition first and then normalization,and the inability to use the interaction between tasks,this paper introduces a joint model for entity recognition and normalization,which uses the structure perceptron,beam-search algorithm and transition strategy to jointly modeling entity recognition and normalization.Moreover,rich linguistic and biomedical features are introduced to fit the entity recognition and normalization tasks.Experimental results on two benchmark data sets(BC5CDR and NCBI disease)show the performance of the joint model is better than that of the pipeline model.(4)The structure perceptron model relies on feature engineering,this paper proposes a joint model based on graph neural network for entity recognition and normalization.The model first takes the word representations provided by Bio BERT as the input,and then encodes word representations by the bidirectional long-short memory(Bi LSTM)neural network to incorporate contextual information.Then,a span-term graph is used to incorporate the rich information of dictionary resources.Finally,two classifiers are used to recognize and normalize named entities simultaneously.The experimental results on two benchmark data sets(BC5CDR and NCBI disease)show that the deep-learning-based joint model is more effective that the feature-engineering-based joint model. |