With the adjustment of the family planning policy in our country,the contradiction between medical services provided by obstetrics and people’s demand has become increasingly prominent.The massive obstetrical electronic medical records(EMRs)have been accumulated with the implementation of medical informatization,and formed the medical big data.The intelligent diagnosis based on EMRs can improve the quality and efficiency of diagnosis and treatment,which provides an important way to relieve the contradiction between medical supply and demand.The first problem of applying the real EMRs to intelligent diagnosis is how to de-identify,clean and standardize these records in order to provide data support.In the diagnosis and treatment activities,the diagnosis results of the patients include normal diagnosis,pathological diagnosis and complications,etc.,rather than a single diagnosis.And the intelligent diagnosis can be treated as multi-label classification task based on the EMRs dataset.The precision of multi-label classification is the key factor which determines whether intelligent diagnosis can be applied to clinical practices and improve the quality and efficiency of diagnosis and treatment.How to improve the classification performance by integrating the characteristics of EMRs and the domain knowledge is the problem that needs to be focused on.To solve the above problems,this thesis studies the obstetric EMRs dataset construction and the multi-label classification based on the dataset,in which the methods to improve the intelligent diagnosis performance by integrating different level knowledge are the main focuses.The main research results of this thesis are as follows:(1)Through the processing of de-identification,data cleaning,diagnosis standardization,and numerical features extraction,a dataset for obstetric intelligent diagnosis is constructed.A TT-BiGRU model combining text template(TT)and Bi-GRU(Bi-direction Gate Recurrent Unit)is proposed to remove the protected information in EMRs.The precision of TT-BiGRU in Chinese EMRs achieves more than 96%.The proposed model is able to complete the de-identification work with the help of less labor.Rule-based method and semantic similarity calculations is applied to EMRs data cleaning,standardization and numerical feature extraction.Data cleaning removes different types of errors and redundancy in real EMRs.Standardization reduces the diversification of diagnostic labels,and the size is reduced from 1,640 to 265.The 18 categories of numerical indicators extracted provide richer features for intelligent diagnosis.After the processes above,the dataset containing 24,339 obstetric EMRs was formed.(2)A hierarchical information-enhanced BERT(HIE-BERT)multi-label classification model is proposed for intelligent diagnosis.The obstetric EMR includes text and numeric data.According to the different degree of importance for diagnosis,the text contained in EMR can be divided into zero difference information,basic information,and key information.And the numerical information in EMR is also an important basis for diagnosis.The HIE-BERT model for multi-label classification is built through differentiated processing,normal input and introduction of key information vector,and enhancement of text features and fusing of numerical features through the enhanced layer of multi-head attention.The experimental results show that,compared with traditional multi-label learning methods and other deep learning models,the hierarchical information introduction and enhancement of text features and numerical features in HIE-BERT model effectively improves the performance of intelligent diagnosis.Compared with BERT model,the average precision of high-frequency diagnosis labels is improved by 3.6%,and reaches 88.5%.(3)A Chinese obstetrics knowledge graph(COKG)was constructed,and is integrated with HIE-BERT model,which forms KG-HIE-BER multi-label classification model for intelligent diagnosis.Considering MeSH-like framework as knowledge ontology,COKG is constructed based on semi-automatic and automatic extraction of entities and relationships in medical text from multiple sources.COKG contains 10,674 entities and 15,281 relationships.This thesis proposes the KG-HIE-BER multi-label classification model for intelligent diagnosis.Specifically,KG-HIE-BER is built by fusing HIE-BERT based on three steps:(a)to establish the link between EMRs and COKG entities through multi-semantic similarity synthesis,(b)to obtain the candidate set of corresponding diseases based on the relationship links,and(c)to calculate the weight of diagnostic labels based on the symptom-disease comprehensive weight prediction algorithm.The experimental results show that the integration of domain knowledge COKG improves the average precision of intelligent diagnosis by 3.2% for all EMR diagnoses labels,and the average precision reaches 88.9%.(4)Taking KG-HIE-BERT and COKG as the core,the obstetric intelligent diagnosis system is developed,namely "XuanBei".It provides the functions of data processing,EMR quality control,intelligent diagnosis,similar EMRs recommendation,and disease-related knowledge querying.It has been applied to the internship and training of postgraduates in a maternity and childcare hospital,and achieved good feedback. |