With the rapid development of clinical medicine,more and more research results are published in the form of clinical trial literature,with a large number of clinical evidences contained in the literature in the form of unstructured texts.Clinical evidence is an important basis for the formulation and updating of clinical practice guidelines.Therefore,the automatic extraction of basic evidential information,such as drug name,disease name and therapeutic effect index from these unstructured clinical trial literature plays a crucial role in the formulation and updating of clinical practice guidelines.However,because of the lack of benchmark data sets for extraction of drug names,disease names and therapeutic effect indexes from clinical trial literature,the research of entity extraction methods in the field is at a low rate of progress.In addition,the fact that there are few systems specifically designed for the extraction of clinical trial literature entities and few clinical researchers with strong computer related background knowledge also causes obstacles to carrying out relevant researches.Therefore,focusing on the above problems,this paper mainly includes the following three parts:(1)In this paper,a benchmark data set for drug name,disease name and therapeutic effect index entity extraction from clinical trial literature was constructed.Firstly,a total of 223,622 drug clinical trials were collected from Pub Med,including systematic reviews,meta-analyses and randomized controlled trials.Then,8000 abstracts of literatures were selected according to the quantity ratio of the three types of literatures collected,and then labeled and reviewed for named entities through the "man-machine collaboration".Finally,a benchmark data set containing 46,578 drug names,25,559 disease names and18,970 therapeutic effect indexes was obtained.(2)This paper proposes an entity extraction model(MT-BioKMNER)for multitask learning and key-value memory network based on BioBERT,additionally,comparative tests were conducted with CRF,BILSTM-CRF,BERT and BioBERT on the constructed data set and four public data sets(BC5CDR,BIONLP11 ID,BC2GM and NCBI-Disease).Besides,the effects of multi-task learning mechanism and key-value memory network on the model were analyzed comparatively.The experimental analysis shows that the performance of the MT-BioKMNER model proposed in this paper is superior to the other four models both in the constructed data set and in the public data set.On the constructed data set,the average F1 value of the three entities reached 75.82%,2.54% higher than that of the optimal BioBERT model in other models,among which the F1 value of drug name,disease name and therapeutic effect index accordingly reached81.72%,68.58% and 77.17%.The effect analysis of multi-tasking learning and key-value memory network also verified the effectiveness of the two mechanisms introduced in this paper to improve the model performance.(3)In this paper,a clinical trial literature entity extraction system was developed based on MT-BioKMNER model.The operation interface of the system is simple and friendly,and convenient,beneficial for researchers to look up and retrieve the collected clinical trial literature information,and realize automatic extraction of drug names,disease names and therapeutic effect indexes from literature abstract texts. |