| Phenotype information of patients in electronic health records(EHRs)is mainly recorded in natural language that cannot be directly utilized for clinical study.EHR-based deep phenotyping algorithms can structure patients’ phenotype information in EHRs with high fidelity,which has become the focus of medical informatics.Nevertheless,developing a deep phenotyping method for non-English EHRs(such as Chinese EHRs)is a big challenge.Despite there are numerous EHR resources in China,the data with fine-grained annotation suitable for deep phenotyping are still limited.It is challenging to develop a deep phenotyping method for Chinese EHRs in such a low-resource scenario.In this study,we wanted to develop a deep phenotyping method with great generalization ability for Chinese EHRs,which could be done with limited fine-grained annotation data.The key of this methodology was to identify linguistic patterns of phenotype information from Chinese EHRs using a biological sequence motif discovery tool,and then to perform deep phenotyping on Chinese EHRs by linguistic patterns recognition.Specifically,a total of one thousand Chinese EHRs were manually tagged based on a predefined information model-PhenoSSU(the Semantic Structured Unit of Phenotypes).The whole dataset was randomly divided into a training set(70%)and a test set(30%).The specific process of learning linguistic patterns was divided into 3 steps:Firstly,the free text of the training set was encoded as a single-letter sequence.Secondly,a biological sequence analysis tool named MEME motif discovery tool was utilized to explore motifs in that single-letter sequence.At last,the learned motifs were reduced to a list of regular expressions that represented linguistic patterns of the PhenoSSU model in the text of Chinese EHRs.Based on the learned linguistic patterns,we developed a deep phenotyping method for Chinese EHRs,including a deep learning-based model for entity recognition and a pattern recognition-based method for attribute prediction.To prove the potential application of EHR-based deep phenotyping,we made two case studies:exploring the real-world evidence that EHR-based deep phenotyping can update knowledge in guidelines and constructing a machine learning model to predict TSH levels based on deep phenotyping on EHRs of physical examination.51 sequence motifs with statistical significance were learned from seven hundred EHRs in the training set and were then reduced into six regular expressions.A subsequent test showed that the six regular expressions could be learned from 134(+/-9.7)Chinese EHRs in the training set.Our deep phenotyping method for Chinese EHRs can extract PhenoSSU instances with an overall accuracy of 0.844 on the test set.For the task of named entity recognition,our algorithm achieved an F1-score of 0.898 with a BERT-BiLSTM-CRF model;for the attribute prediction task,our algorithm achieved a weighted accuracy of 0.940 based on our linguistic pattern-based method.By comparing the EHRs of chronic bronchitis with the relative guideline at the PhenoSSU level,it is possible to update the guidelines based on deep phenotyping of medical records.Moreover,we used our algorithm for deep phenotyping EHRs on a physical examination corpus.Afterward,a machine learning model that can predict the level of thyroid-stimulating hormone in the physical examination population is constructed in this study.In this study,we developed a simple but effective method for deep phenotyping of Chinese EHRs based on a limited fine-grained annotation data set.Our work will promote the second use of Chinese EHRs and give inspiration to other non-English-speaking countries. |