| Named entity recognition and relation extraction are basic and core tasks of information extraction.They aims to extract entities and the types of semantic relations between entities from unstructured and semi-structured texts.High-quality and accurate extraction results can be good data foundation for the construction of knowledge graphs,information retrieval,and intelligent question answering system.As the carrier of TCM transmission,There are a large amount of unstructured or semi-structured TCM information resources in TCM literature.For example,syndrome types,prescriptions,Chinese medicines,etiology,pathogenesis,and treatment methods.It is increasingly urgent to extract information from these data.The traditional method to realize the two tasks of named entity recognition and relationship extraction is to use pipeline to perform in two steps.Entity recognition is performed firstly,and relation extraction is performed based on the result of the previous step.Although the traditional pipeline method is more flexible and simple for model selection and experimental operation,it has three problems.(1)It leads to error accumulation;(2)It ignores the correlation between two subtasks;(3)Redundant information.In order to overcome the problems of the pipeline method,joint extraction methods are proposed.They consider fully the correlation between the two tasks.So the performance of the two tasks is improved.However,the existing joint extraction methods also face some problems:(1)Cannot solve the problem of entities overlap;(2)Relying on manual labeling of corpus,which consumes manpower and material resources,and the utilization rate of corpus is not high.In view of the above problems,the main contents of this article are:(1)Considering the characteristics of specific fields of TCM texts,the article adopt the improved sequence tagging strategy and construct the corpus of TCM texts about joint extraction of entities and relations.That provide high-quality annotation data for joint extraction of entities and relations.(2)The article proposes a joint TCM entities and relations extraction method.In this way,concatenating char embedding and word embedding in parallel is the input of bidirectional LSTM-CRF.It uses the powerful feature extraction capabilities of twoway LSTM,and the outstanding CRF in sequence labeling Advantages.And that combines with optimized extraction rules to achieve joint extraction of TCM relations.This method not only overcomes the shortcomings of the traditional pipeline method,but also eases the problem of entity overlap to a large extent.(3)The article proposes a joint entities and relations extraction method that integrates data enhancement and attention mechanisms.EDA is selected to enhance the data of Chinese medicine corpus.The original data set and the predicted pseudo-labeled data are jointly learned by self-training to solve the label.The problem of lack of data has good adaptability in the task of joint extraction of entity relations of Chinese medicine corpus.(4)On the basis of the above research,the article designs and constructs a joint extraction system of entity relations of Chinese medicine texts. |