Study On The Phenotypic Extraction Method Of Clinical Records Based On Multi-task Learning

Posted on:2022-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q S Zou

Full Text:PDF

GTID:2494306563479094

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Clinical medical records are an important type of data,which contain valuable and detailed patient information for clinical analysis.In recent years,natural language pro-cessing in the medical domain has become an active research direction in biomedical in-formatics.However,Chinese clinical records usually exist in the form of semi-structured text,which has caused some troubles to medical research.Therefore,we need to ex-tract information from medical records to form structured data and make effective use of clinical text.At present,a large number of methods based on deep learning and pre-training have emerged,which have achieved excellent research in the field of biomedical named entity recognition.However,the relevant research on multi-tasking learning is still scare.Hence,we combined natural language processing methods to construct datasets in human-machine cooperative manner,and then implement a fine-grained phenotypic named entity recognition method,and explore the practice of multi-task learning in the field of biomedical named entity recognition.This paper introduces from the following three aspects:（1）In view of the lack of existing Chinese standard datasets,this paper constructs a framework of human-machine cooperative phenotypic spectrum annotation,which in-tegrates unsupervised entity extraction,entity matching pre-labeling,homologous entity extraction model pre-labeling and core sample screening based on greedy algorithm.At present,based on this framework,we have built four standard datasets,which is TCM-HN,COVID-19,TCM-SX and TCM-HB.They contain a total of 76,581 medical records and 1,675,200 labeled entities,including more than 10 entity types such as negated symp-tom,presented symptom and disease.The statistics of labeling results show that 80%of the entities are made by machine labeling,and the manual workload of the audit part only accounts for about 40%,which indicates that the human-machine cooperative labeling greatly reduces the workload of manual labeling.（2）Based on TCM-HN and COVID-19 data,we propose a fine-grained pheno-typic named entity recognition method:Phenonizer,which utilized BERT to obtain the character-level global contextual representation,extracted local contextual features com-bined with Bi LSTM,and captured the dependency between entity tags through CRF.The results on the COVID-19 dataset show that Phenonizer outperform those methods based on random,embedding,Glo Ve and Word2Vec with F1-score of 0.8960 By comparing character embeddings from different fields,it is found that character embeddings trained by medical corpora can improve F-score by 0.0103 In addition,we evaluated Phenon-izer on two kinds of granular datasets,and proved that our fine-grained dataset can boost F1-score of methods slightly by about 0.005 Furthermore,the fine-grained dataset en-ables methods to distinguish between negated symptom and presented symptom.Finally,we tested the generalization performance of Phenonizer,which has a F1-score of 0.8389and fine-tune it by fusing a small part of COVID-19 dataset to improve the F1-score to0.9097 The results show that Phenonizer is a feasible method,which can extract symptom information effectively and has good generalization performance.（3）For the exploration of multi-tasking learning in the field of Bio NER,this paper proposes a cascade method of multi-tasking phenotypic entity extraction:CMTL-NER.Based on W2V_Medical-BILSTM-CRF and Phenonizer,CMTL-NER reduces the number of parameters and the training time of the model without affecting the model performance.We compared the performance and training time of models with different text lengths.On CCKS-19 dataset,the F1-score of CMTL-NER is 0.01 higher than that of single models.In the TCM-HN dataset,the F1-score of CMTL-NER is 0.005 higher.In addition,the training time of each epoch of CMTL-NER is shorter than that of single-task models.Fi-nally,the classic multi-task Bio NER method MTM-C is modified to apply to the Chinese datasets and is used as the baseline model.On both TCM-HN and TCM-HB datasets,the F1-score of MTM-C is slightly 0.02 lower than that of the optimal Phenonizer _{CMT L-NER},which proves the performance and stability of the CMTL-NER method.

Keywords/Search Tags:

Chinese clinical medical records, standard dataset, named entity recognition, multi-task learning, text mining

PDF Full Text Request

Related items

1	Research On Named Entity Recognition For Chinese Electronic Medical Records
2	Transfer Learning Based Named Entity Recognition On Electronic Medical Records
3	Medical Text Named Entity Recognition Based On Improved Sequence Labeling Model
4	Research On Chinese Medical Text Named Entity Recognition Method
5	Named Entity Recognition In Medical Field Based On Deep Learning Of Chinese
6	Research On Method Of Medical Named Entity Recognition Based On Pre-trained Model
7	Medical Named Entity Recognition Method Based On Dynamic Networks
8	Research And Application Of Key Techniques For Named Entity Recognition Of Electronic Medical Records Based On Deep Neural Network
9	Research On Chinese Medical Text Named Entity Recognition Based On Semi-Supervised Multi-Feature Model
10	GAN-based Named Entity Recognition For TCM Text