Font Size: a A A

Automatic Recognition Based CRFs "Treatise On The" Chinese Terms

Posted on:2015-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:H Y MengFull Text:PDF
GTID:2264330428474678Subject:Basic Theory of TCM
Abstract/Summary:PDF Full Text Request
Ancient Works are one of the main resources where we seek for knowledge of Chinese medicine (CM) that was accumulated from clinical practice. Further development and rearrangement of these works may provide materials°for knowledge discovery. However, the huge amount, the unique terminologies in CM and the archaic expressions unfamiliar to today’s readers are the major hurdles to the study. With the rapid progress in technologies, all the more new information processing methods have been proposed, and some of them may help gain prospect for better handling of the problems facing CM ancient works. And this is where the history leads us to go. Among them, information extraction techniques serve to extract specific information from provided texts through computer, rendering it structuralized and stored in database. It is especially productive in processing huge amount of information into what we really need. So this may lay foundation for the development of the structuralized CM diagnosis and treatment platform.The identification of terminologies is an important step that determine whether the extraction is precise or not, and is also the foundation of knowledge discovery, machine discovery, question-answering, knowledge extraction, information retrieval, text mining and many others. This dissertation by comparing several popular terminology identification techniques found that the technique based on statistics and rules, among others, can be used to solve the CM ancient work problem. In the dissertation,4statistical models including hidden Markov model (HMM), maximum entropy model (MEM), maximum entropy Markov model (MEMM) and conditional random field (CRF) were introduced and analyzed with respect to their compatibility in the research of CM ancient works. On this basis, CRF was found to be suitable for our research, so it was explicitly expounded. Treatise on Cold Damage (《伤'论》),as one of the four great works of CM that guides clinical practice, was adopted as the object in our study. By identifying and study the terminologies in this work our purpose is to realize the automatic identification of CM terminologies and to provide reference for CM informatization.The purpose of this study:(1) To realize the automatic identification of CM terminology from the perspective of CM informatizaiton, making for the structuralized CM diagnosis and treatment platform.(2) Base on the performance of4terminology identification models, the author adopted the method of CRF fused with a multiple of features to seek the best feature combination with respect to model performance.(3) The author tried to develop a method that helps medical researchers realize the automatic knowledge discovery in CM works by providing an open source tool for CM researchers.Method:This study is funded by the National Natural Sciences Funds—"research on the recognition mode of CM based on question-answering system (No.81273876)" and "research on visualized modeling of the diagnosis and treatment information of CM based on system complexity". We have adopted the Treatise on Cold Damage (Zhao Kaimei version published in Ming dynasty) as the research object, and the open source software package CRF++0.58as the tool to realize CRF.(1) First, the author made an analysis of the current limitation—lack of segmentation methods and its application in Chinese text segmentation, and then determined the word segmentation method based on word terminology recognition experiments.(2)The data was then cleaned, the features selected and noted, and the feature template programed.(3) According to different design of experiment, the test texts were divided into4groups: character+category labels; Character+word boundary, category labels; Character itself+word property+category labels; and Character itself+category labels+word property+word boundary.(4) The training text and template were processed into the CRF++training software package, and the a model file was obtained (5) The model file and test text was introduced into CRF++test software package, produceing recognition results.(6) Results were evaluated by analyzing different experimental identification performances of the4groups.Results:(1) With respect to the control group and experimental group, it was found that the recognition performance of the model was greatly improved;(2)For2nd and3rd experiments, the accuracy and recall rates and F values are higher than the experimental group, showing the feature "word boundary" feature serves better in improving model performance than the feature "word property";(3) Comparing4th and the other3experiments, the introduction of the word, word boundary, word property, category labels, accuracy and recall rates, and F value showed higher performance meaning that it has the most optimal the recognition efficiency.Conclusion:(1) The experiment by using computer to realize the treatise on febrile disease of traditional Chinese medicine terminology recognition research, and obtained the comparatively good recognition effect.(2)It can be observed from the experimental results that even the results from best performance experiment is far inferior to its counterparts in biological medicine and news report. This may be a result of the special terminologies and grammar of Treatise on Cold Damage. For example, the3sentences’‘发汗吐下后,虚烦不得眠”、“寸口脉浮大,而医反下之”and“脉浮而大,心下反硬”all havethe word “下” in them, but the meanings are not exactly the same with the first2representing a kind of treatment and the last one indicating a position. Situations like this may compromise the identification accuracy.(3)0n the basis of previous researches, this dissertation proposed a automatic CM terminology identification approach focusing on Treatise on Cold Damage. And through experiment it is found that the combined features performed better than single feature, and the more features are considered, the better performance we may achieve.(4)Information extraction technology in the field of structured electronic medical records of traditional Chinese medicine and Chinese medicine professional search engine based plays an important role, in this study have important practical significance to the development of TCM informatization.(5) Considering the status quo of automatic terminology identification, our future research may focus on the extension of training samples to refine all the more features, the completion of statistical processing and template rules and the exploration of better performed model.
Keywords/Search Tags:Treatise on Cold Damage, Terminologyidentification, Conditional random field, terminologyof Chinese medicine
PDF Full Text Request
Related items