| Traditional Chinese Medicine(TCM)medical record is an important carrier for the inheritance and development of traditional Chinese medicine,which records information such as the diagnosis of patients’ diseases and the rule of TCM.The text that records information is called TCM text.It is of great significance to explore and utilize the effective information contained in TCM text for the smooth progress of follow-up research and promoting the development of TCM.In order to efficiently explore the effective information in TCM text,researchers need to use natural language processing technology to process TCM text.Word segmentation is a key step in the process,and the accuracy of the results will have a certain impact on subsequent experiments.The phenomenon of divergence in word segmentation is the main reason that affects the accuracy of word segmentation in TCM text.This thesis constructs the TCM text segmentation model and the TCM text multi-feature ambiguity resolution model respectively in order to resolve the segmentation ambiguity in the process of TCM text segmentation and improve the precision of TCM text segmentation.The main work of this thesis are as follows:(1)Firstly,the medical records of TCM are collected and sorted.A total of 20000 medical records collected from the Second Affiliated Hospital of Shandong University of Traditional Chinese Medicine from 2005 to 2020 are selected as the dataset,and the content of the TCM text medical records is summarized and analyzed;Secondly,the TCM medicine data and TCM symptoms of TCM text is standardized;Thirdly,the causes and classification of ambiguous fields are analyzed,and the difficulties in resolving ambiguous fields are summarized.Finally,multi-feature of ambiguity resolution are select by analyzing the features of TCM texts.(2)A TCM text segmentation model based on Bi-GRU is proposed.Firstly,the TCM text is annotated with four-digit BMES(B represents the first character,M represents the middle character,E represents the last character,and S represents a single word);After the annotation is completed,the text is vectorized by Word2 vec method to obtain the text vector.Secondly,the text vector is used as the input of Bi-GRU neural network,and the information in forward and backward directions is obtained,and the possible labels of each word are obtained.Finally,the label sequence with the highest probability is selected as the final word segmentation result by Viterbi algorithm.(3)A multi-feature ambiguity resolution model for TCM text is proposed.Based on the combined ambiguity in the disagreement,the TF-IDF algorithm with added word length is used to calculate the weight features of weight generation,and the contextual word features and part-of-speech features within the text window where ambiguous fields are located is extracted according to the characteristics of TCM text language,including as concise,fuzzy and unstructured.The weight feature,context word feature and part-of-speech feature are combined into multi-feature and to from the feature vector,which are input into nonlinear support vector machine to construct a "combination" classifier and a "division" to obtain ambiguous segmentation results.Comparative experiments are carried out to verify the performance of the TCM text segmentation model and the TCM text multi-feature ambiguity resolution model designed in this thesis.The experimental results show that the accuracy of the word segmentation method in this thesis reaches 93.26% and the accuracy of segmentation words after ambiguity resolution reaches 94.75%,which indicate that the methods proposed by this thesis are feasible and effective. |