| With the development of clinical information technology,a large number of data resources have been accumulated in the medical field.Among them,electronic medical records(EMRs),one of the important data sources of clinical information,contain rich medical knowledge.How to obtain valuable information from these data has become the basis of medical intelligence research.EMRs named entity recognition and relation extraction are the key steps to extract effective information,which has been studied by many scholars.However,due to the scarcity of relevant corpus,entity nesting and relation overlap,the task of EMRs information extraction is difficult to achieve the desired effect.Therefore,this thesis takes the EMRs of diabetes as the starting point,based on the construction of the electronic medical record labeling data set,the entity recognition and relation extraction of EMRs are studied.The main works are as follows:(1)Construction and analysis of EMRs entity recognition and relation extraction data set.In view of the lack of corpus of EMRs,this thesis starts with the text of the EMRs of diabetes,taking ICD-10,I2B2 and the existing medical annotation standards in China as references,combined with the characteristics of the density of diabetes EMRs,a fine-grained annotation system is established.Through the evaluation of medical experts,the annotation specification is developed.The Diabetes Electronic Medical Record entity and relation Corpus(DEMRC)has been formed by semiautomatic annotation and multiple rounds of manual proofreading through distributed annotation platform.DEMRC contains a total of 8,899 entities,456 entity modifications as well as 16,564 relations,and the consistency of entity and relation labeling reaches 85.62% and 94.16% respectively.The experimental analysis and evaluation show the reliability of the data set.(2)This thesis proposes a named entity recognition model T-Ro BERTaBi LSTM-CRF based on transfer learning.The model first uses the open medical evaluation data set CMe EE as the source domain data to train Ro BERTa,then uses the updated Ro BERTa model to embed the words of EMRs data,which is entered into the Bi LSTM layer to obtain the two-way semantic information representation,and converts the hidden layer information encoded by the model into the probability information of the label through the CRF layer.The F1 values on the DEMRC and CCKS 2019 data sets reach 94.38% and 84.91% respectively.The experimental comparative analysis shows the effectiveness of data migration between different sources in the same field.(3)This thesis proposes a multi-head relationship extraction model Ro BERTaGSI-PM that integrates graph structure information.The Ro BERTa model is used in this model to encode the input text in the embedding layer,and uses GCN to obtain the graph structure information in the structured medical knowledge,which are fused together to get access to more semantic information of the input text.Then the fused vector is input into the pointer network to complete entity recognition,and the entity relation is extracted by multi-head selection mechanism according to the extracted entity vector.The F1 values on DEMRC and Dia KG data sets reach 63.53% and 45.22%respectively.The experimental comparative analysis shows the effectiveness of this method in small-scale complex entity relation extraction. |