Font Size: a A A

Named Entities Recognition Based On Recurrent Neural Network In Biomedical Literatures

Posted on:2017-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:L K JinFull Text:PDF
GTID:2348330488458748Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In bio-medical field, recognizing different types of entities is the first step in a number of information extraction tasks such as relation extraction, text classification, coreference resolution and event extraction. For the current existing methods, rich domain expert knowledge and amount of artificial features are rather important in the system construction. In this paper, pre-trained word embeddings and recurrent neural networks are mainly adopted to fulfill simple and effective named entities recognition system with a series of extensions and improvements. The performance and generalization on different corpus have been greatly improved.First, based on the conventional Recurrent Neural Network (RNN), both hidden layer and output layer are added with recurrent connection. Thus, the hidden layer can maintain and record the historical information and the output layer can take advantage of probabilistic information from previous state. Besides, in order to solve the problem of incomplete information caused by subsequence division, brown clustering algorithm and Latent Dirichlet Allocation (LDA) are adopted to provide a contextual vector in associate with each word to model wider range of semantic information. Then two unidirectional RNNs with different directions are combined for bio-medical named entities recognition. And the F-score achieve 83.62% on the Biocreative II GM corpus.Second, to further improve the performance of named entity recognition and overcome vanishing gradient problem of conventional RNN when dealing with long sentence, Long Short-Term Memory (LSTM) is applied to the recurrent neural network. Then bidirectional recurrent neural network with LSTM unit is built. Considering fine-tuning process of word embedding can lead to change of pre-training word embedding which contains rich syntactic and semantics information, this paper use two different word embeddings to extend the LSTM. Besides, in terms of difference value of two kinds of word embeddings, sentence vectors can be obtained. Finally, Sentence vector/Twin word embeddings conditioned Bidirectional LSTM (ST-BLSTM) is constructed for named entity recognition. On the Biocreative II GM corpus, this framework gets an F-score of 88.61%. Compared with the top contest system which combined dictionary and multiple classifications, the F-score rises 1.40%.Above all, this paper mainly adopts two different recurrent neural networks for named entity recognition to avoid the cost brought by artificial features. And the ST-BLSTM model has a better recognition performance and generalization. Compared with the traditional RNN on the Biocreative ? GM corpus, the F-score can be improved by 4.99%; and it is also higher than single recognition system with artificial features by 1.33%.
Keywords/Search Tags:Named Entities Recognition, Word Embedding, Recurrent Neural Network, Sentence Vector, Long Short-Term Memory
PDF Full Text Request
Related items