The ancient books and precriptions of Traditional Chinese Medicine(TCM)have recorded the experiences of syndrome differentiation that has been passed down for thousand years.The Natural Language Processing(NLP)techniques can be used to analyze these books and precriptions to assist doctors in decision-making and diagnosis to promote the intelligent development of TCM.Herb recommendation is an important task which can help achieve intelligent diagnosis,while TCM entity recognition is the basic task of constructing data for herb recommendation by identifying entities such as herbs and symptoms.However,at present,the TCM ancient books are poorly digitalized.It is difficult to obtain a large scale of available domain corpus.Besides,most of the TCM ancient books are written in ancient Chinese,which is hard to segment.It is worth thinking to find out the way to effectively use the semantic information of word vectors to improve the performance of TCM entity recognition.In addition,when doing herb recommendation,TCM pays attention to the syndrome differentiation rather than matching the symptoms and herbs directly,and the herb compatibility should also be emphasized.How to design a reasonable neural network structure to incorporate TCM domain knowledge to help improve performance is worth to think about.According to the above problems,this thesis cooperates with the TCM experts from Captial Medical University.Based on the construction of the TCM corpus,an entity recognition algorithm based on semi-supervised learning and fusion words are designed to provide data support for the herb recommendation.Afterwards,a labelehanced model for herb recommendation is designed to assist intelligent diagnosis in TCM.The main research content and contributions can be summarized as follows:(1)This thesis proofread 376 TCM books to construct a large-scale of unlabeled corpus.Then,the open source segmentation tool is used to segment TCM books and Word2 vec is utilized to train word vectors for research on the related tasks in TCM field.(2)A semi-supervised learning-based entity recognition algorithm for TCM books is proposed.It is based on the conditional random field.In addition to the supervised features such as parts of speech and dictionaries,an unsupervised semantic feature based on vector similarity is designed.The experimental results show that this algorithm improves the performance by using the information of large-scale unlabeled corpus and F1-score reaches 73.94%.To further reduce the load of manual labeling,and to make reasonable use of TCM word vector information as well,a TCM entity recognition algorithm based on word fusion is proposed.The algorithm combines word frequency to design a weight calculation method to integrate the vector representation of each word with different word segmentations,so that it can fully exploit the TCM word vector information in the case of inaccurate word segmentation.The experimental results show that this algorithm can further optimize the entity recognition performance and F1-score reaches 74.38%.Besides,by adding the rules of TCM linguistic,the F1-score is improved to 83.18%,which is better than other methods.(3)A label-enhanced herb recommendation model is proposed.The model mimics the syndrome differentiation in the process of TCM diagnosis by designing a syndrome differentiation module based on the self-attention mechanism.In addition,through the design of graph neural network and attention mechanism module,it depicts the compatibility of herbs and the relationship between symptomatic herbs.The herb simulation distribution is formed as a new model training target through the label enhancement.The experimental results show that this proposed model can significantly improve the performance of herb recommendation.The final p@5 and r@5 reaches33.47% and 24.59%,respectively. |