| Electronic medical records(EMRs) are digitized records written by the medicalstaff for the individual patient’s medical activities. They are the alternative to thetraditional paper-based medical records. EMRs contain comprehensive, informative,professional, real-time, accurate description about individual patients’ health. It is avery valuable knowledge resource. Through analysis and mining of electronicmedical records, we can derive a lot of medical knowledge which are closely relatedto patients. These knowledge can be used to build clinical decision support systemsand provide personalized health information services. EMRs are not fully structureddata. Unstructured data in the form of free text occupies an important position in theEMRs. Thus, word segmentation and named entity recognition and other naturallanguage processing technologies will play an important role in the EMR datamining.The most effective word segmentation and named entity recognitionapproaches are based on dictionary or supervised machine learning. However, dueto the professionalism of electronic medical records, constructing specializeddictionaries or training corpus is extremely difficult. In order to overcome thedifficulties of obtaining material corpus, this paper proposes EMR wordssegmentation and named entity mining methods based on semi supervised learning.A large number of unknown words are the greatest challenge to Chinese wordEMR word segmentation. They are usually the medical jargon and abbreviations.This article divide EMR word segmentation into two steps. First, we use a lexiconof general domain to generate an initial segmentation. To deal with the ambiguityproblem, we build a probabilistic model. The probabilities of words are estimatedby an EM procedure. Then we use the left and right branching entropy to buildgoodness measure and regard the recognition of unknown words as an optimizationproblem which can be solved by dynamic programming. Experimental results showthat the method is feasible, with a strong ability to identify unknown words, it isbetter than the entropy-based boundary unsupervised segmentation. The experimental results show that our method is effective and better than unsupervisedmethods.Compared with the open field texts, there are many differences in ChineseEMR. EMRs use semi-structured way to organize the various parts and the languagein EMRs contains many significant patterns. For these features, we propose a divideand conquer strategy. We use text patterns to extract different types of entities fromdifferent part of content. The patterns can be learned by Bootstrapping algorithmfrom large unlabeled corpus using a small amount of labeled entities. Theexperimental results show that our method is effective when extracting diseasesfrom EMRs. However, it needs further improvement when extracting treatments anddrugs. |