| With the vigorous development of life science,the related literature in the field of pharmaceutical shows an exponential growth trend.Extracting structured and organized information of the compound from these massive unstructured medical literatures will effectively help the researchers in both pharmaceutical and related field to carry out studies,and then promote technological innovation in drug industry.Among them,the chemical named entity attracts considerable attention concerned by the professionals,which acts as the main carrier for information analysis of the literatures.Therefore,the related named entity recognition has become an important research topic.Among the existing NER methods,the Long Short Term Memory with a Conditional Random Field layer(LSTM-CRF)is one of the most advanced and commonly deployed approach.However,this supervised learning method usually requires a large number of labeled corpus,which is very limited for some professional fields,such as the drug patent studied in this paper.In such a case,the supervised learning model cannot accurately tag the corresponding entities.In order to overcome the above shortcomings,a semi-supervised named entity recognition approach is proposed in this study,which is based on the combination of bidirectional long-term memory network and word similarity as well as conditional random field layer(BiLSTM-WS-CRF).Firstly,the vector representations of the words contained in each label are clustered to obtain the clustering center which is regarded as the representatives of the label,and the appropriate similarity measurement method is selected to measure the relationship between each input word and different labels to generate the corresponding vector representation.Then,the expression of the vector is combined with the output of hidden layer of BiLSTM to calculate the confidence score.Finally,the score is input to the CRF layer to obtain the predicted tag that conforms to the marking strategy.In this way,the proposed model not only introduces the unsupervised learning characteristics to guide the tagging process,but also preserves the advantages of the supervised BiLSTM-CRF model that takes into account both the long-short-term dependencies among input sequence and dependencies between labels.Experimental study shows that,comparing with the traditional baseline model and other commonly deployed semi-supervised methods,the proposed method has obvious advantages in named entity recognition task in pharmaceutical and other professional fields.Aims at facilitating the related researchers to read and analyze the literature,this study further designs a system software for named entity recognition on drug patents,which realizes a series of functions including text processing,word vector training,named entity recognition,entity visualization,etc.It provides supporting information for medical research as well as be beneficial for accelerating the drug development process. |