Clinical medical records are an important type of data,which contain valuable and detailed patient information for clinical analysis.In recent years,natural language pro-cessing in the medical domain has become an active research direction in biomedical in-formatics.However,Chinese clinical records usually exist in the form of semi-structured text,which has caused some troubles to medical research.Therefore,we need to ex-tract information from medical records to form structured data and make effective use of clinical text.At present,a large number of methods based on deep learning and pre-training have emerged,which have achieved excellent research in the field of biomedical named entity recognition.However,the relevant research on multi-tasking learning is still scare.Hence,we combined natural language processing methods to construct datasets in human-machine cooperative manner,and then implement a fine-grained phenotypic named entity recognition method,and explore the practice of multi-task learning in the field of biomedical named entity recognition.This paper introduces from the following three aspects:(1)In view of the lack of existing Chinese standard datasets,this paper constructs a framework of human-machine cooperative phenotypic spectrum annotation,which in-tegrates unsupervised entity extraction,entity matching pre-labeling,homologous entity extraction model pre-labeling and core sample screening based on greedy algorithm.At present,based on this framework,we have built four standard datasets,which is TCM-HN,COVID-19,TCM-SX and TCM-HB.They contain a total of 76,581 medical records and 1,675,200 labeled entities,including more than 10 entity types such as negated symp-tom,presented symptom and disease.The statistics of labeling results show that 80%of the entities are made by machine labeling,and the manual workload of the audit part only accounts for about 40%,which indicates that the human-machine cooperative labeling greatly reduces the workload of manual labeling.(2)Based on TCM-HN and COVID-19 data,we propose a fine-grained pheno-typic named entity recognition method:Phenonizer,which utilized BERT to obtain the character-level global contextual representation,extracted local contextual features com-bined with Bi LSTM,and captured the dependency between entity tags through CRF.The results on the COVID-19 dataset show that Phenonizer outperform those methods based on random,embedding,Glo Ve and Word2Vec with F1-score of 0.8960 By comparing character embeddings from different fields,it is found that character embeddings trained by medical corpora can improve F-score by 0.0103 In addition,we evaluated Phenon-izer on two kinds of granular datasets,and proved that our fine-grained dataset can boost F1-score of methods slightly by about 0.005 Furthermore,the fine-grained dataset en-ables methods to distinguish between negated symptom and presented symptom.Finally,we tested the generalization performance of Phenonizer,which has a F1-score of 0.8389and fine-tune it by fusing a small part of COVID-19 dataset to improve the F1-score to0.9097 The results show that Phenonizer is a feasible method,which can extract symptom information effectively and has good generalization performance.(3)For the exploration of multi-tasking learning in the field of Bio NER,this paper proposes a cascade method of multi-tasking phenotypic entity extraction:CMTL-NER.Based on W2VMedical-BILSTM-CRF and Phenonizer,CMTL-NER reduces the number of parameters and the training time of the model without affecting the model performance.We compared the performance and training time of models with different text lengths.On CCKS-19 dataset,the F1-score of CMTL-NER is 0.01 higher than that of single models.In the TCM-HN dataset,the F1-score of CMTL-NER is 0.005 higher.In addition,the training time of each epoch of CMTL-NER is shorter than that of single-task models.Fi-nally,the classic multi-task Bio NER method MTM-C is modified to apply to the Chinese datasets and is used as the baseline model.On both TCM-HN and TCM-HB datasets,the F1-score of MTM-C is slightly 0.02 lower than that of the optimal Phenonizer CMT L-NER,which proves the performance and stability of the CMTL-NER method. |