| Atrial Fibrillation is one of the common arrhythmias with complex etiology,while some patients who present with asymptomatic or paroxysmal Atrial Fibrillation is difficult to predict,which making the diagnosis and treatment of atrial fibrillation extremely difficult With the rapid development of medical information technology,data in the medical field has been skyrocketing.The vast amount of medical data on Atrial Fibrillation contains a great deal of medical knowledge about the diseases and symptoms that patients suffer from.How to handle,unearth and analyze or manage medical text big data to achieve information retrieval and health knowledge services has become one of the current urgent requirements.Aim at the above issues,this project on the basis of multiple sourced knowledge such as encyclopedic websites,literature,and textbooks which uses natural language processing techniques and manual annotation to extracts knowledge from medical texts.Turning semi-structured and unstructured knowledge into structured knowledge completes the graph of Atrial Fibrillation that based on multisource data.The main work of this thesis is as follows:(1)In the named entity recognition task,a pre-training based RoBERTa-BiLSTMCRF model is proposed for the problem that Chinese entities have multiple meanings of a word.Firstly,RoBERTa-WWM adopts Whole Word Masking to acquire semantic features dynamically,and combines the actual situation of Chinese text,which fully produces an advantage of the pre-trained model.Then the contextual feature information is learned by BiLSTM Finally,using CRF to learn the relations among sequence proximity labels,which achieves named entity recognition.The validation is performed on the AF text dataset constructed in this thesis,and the results show that the proposed named entity recognition model outperforms other comparative models in terms of accuracy,recall and F1 value,which verifies the effectiveness of the model.(2)In the relational extraction task,aim at the problem that small training corpus can easily lead to overfitting,which proposes a BERT_MSD relational extraction model that based on pre-training of Chinese corpus.The model includes two modular layers,BERT and Full Connection layer.Firstly,the BERT module consisting of 2 layers of Transformer encoder is used to obtain the important features of the text.Then we use Multi-Sample Dropout strategy to prevent overfitting before the Full Connection layer,and perform relationship classification after the fully connected layer to achieve relationship extraction and improve overfitting effectively.The results of the validation on the text dataset of housing fibrillation constructed in this paper show that the model can effectively prevent overfitting,converge faster,achieve better results at the same time.(3)It constructed a knowledge graph of Atrial Fibrillation that based on Multisource Data.As for the lack of data in the field of Atrial Fibrillation research,the use of Multi-source medical knowledge data such as authoritative medical and health website data,encyclopedia website data,authoritative Chinese literature,medical textbooks and electronic medical records.By crawlers,query downloads and hospital electronic medical records to obtain the data,which fully display the information related to atrial fibrillation.Firstly,the entity,attribute and relationship categories of AF knowledge graph are determined to form the schema layer of AF knowledge graph.Then we used a combination of manual annotation and automated extraction for knowledge extraction and knowledge fusion,and completed the unstructured text annotation of 16,350 entities and 15,060 triples in one year,and automated extraction of unannotated data using the two knowledge extraction algorithms proposed in this thesis.Fusing all the data yields 10,186 entities,12,115 relationship triples and 22,220 attribute triples.Finally,Neo4j graph database is used for storage and visual presentation,which achieves an intelligent application of medical question and answer system. |