| Stroke,as an acute cerebrovascular accident,is one of the leading causes of death and disablity in the world.Clinical trials play an important role in promoting the prevention,treatment and rehabilitation of stroke,and are an indispensable process in the study of new drugs and other interventions for stroke.However,stroke clinical trials face the challenge of low recruitment efficiency and insufficient subjects,which has become the main obstacle.Therefore,rapid and effective cohort identification of stroke clinical trials will be the key factor to solving the problem,currently,the research of cohort identification based on EMRs(Electronic Medical Records)provides a new approach and method.However,the existing researches either focus on the semantic information extraction of clinical trial eligibility criteria(EC),or focus on phenotyping based on EMRs,especially cohort identification studies for stroke clinical trials are still lacking.Therefore,this study proposes an end-to-end systematic cohort identification method based on the scenario of identifying acute stroke clinical trial cohort,which incorporates the EC and EMRs for the cohort identification of stroke clinical trials with the application of deep learning technology.The main work of this research is as follows:This study systematically reviewed the existing research methods.We investigated the available clinical trial EC datasets,the clinical trials cohort identification methods based on EMRs,semantic information extraction and representation of EC,and stroke phenotyping based on EMRs.Then this study fully analyzed the advantages and disadvantages of the existing researches.Based on the characteristics of the clinical trial cohort identification application scenarios for acute stroke,that is,both the time limit for conditon assessment and the inclusion time window are relatively short,and the imaging diagnosis plays an important role.In this study,the problem of cohort identification was modeled as:the patient cohort C that meets the clinical trial T is identified from the image report set R by fitting functions F(T)and G(R),where the functions F(T)and G(R)represents the stroke types that are extracted from T and stroke phenotyping in R respectively.On this basis,we proposed the BERT-based TextCNN models ecBERT-TextCNN and imagingBERTTextCNN,which respectively constructed the proprietary language models ecBERT and imagingBERT for the field of clinical trial EC and image reporting respectively,and applied the TextCNN to identify the classification of stroke types in EC and EMRs,and then used the HL7 V3 interactive standard in the medical field to complete the expression and interaction of cohort queries and results.In addition,a variety of baseline models such as ecGlove-TextCNN,enBERT-TextCNN,imagingGlove-TextCNN,zhBERT-TextCNN,etc.were constructed for performance comparison.As a result,we proposed a systematic cohort identification method,which analyzed the inherent characteristics of the application scenario,constructed a model that followed the theoretical basis and conformed to the application scenario practice,and achieved the end-to-end interaction between the clinical trial EC and EMRs data.Finally,for EC dataset,we obtained 2,742 stroke clinical trials from ClinicalTrials.gov,and used total 351,337 samples of this platform as corpus to train the language model ecBERT for EC domain.For the EMRs dataset,we covered 14,504 imaging reports,6,671 discharge diagnosis records from the front page of medical records,and used total 368,255 imaging reports as corpus to generate the language model imagingBERT.On model evaluation,we used overall accuracy and weighted macro-average F1 Score to assess the accuracy of EC semantic information extraction and stroke phenotyping.Compared with the baselines,the results showed that the BERT-based model performed better.The ecBERT and imagingBERT-TextCNN models achieved the best performance respectively.The overall accuracy was 0.9175 and 0.9096,and the weighted macro average F1 Score was 0.9087 and 0.8974 respectively.During the case study,we constructed two independent datasets,External_EC_Dataset contained 39 stroke clinical trials,and External_Report_Dataset contained 400 imaging reports from hospitals H01 and H02.The accuracy of the ecBERT model on External_EC_Dataset is 0.8974,and the accuracy of the imagingBERT-TextCNN model on H01 and H02 were 0.8350 and 0.8700,respectively.In summary,this research aimed at clinical trials in the stroke disease domain,and constructed a systematic method covering the semantic information extraction of EC and stroke phenotyping,and made use of the advantages of deep learning and common exchange standards in the medical field,and achieved high performance on both the test dataset and the external dataset.Therefore,this study is helpful to improve the efficiency of stroke clinical trials cohort identification,and ultimately promotes the research of stroke disease. |