The unique feature of acupuncture in Chinese medicine for treating diseases lies in the complex combination between acupoints and acupuncture points,and it has been a hot research topic in recent years to classify acupuncture points by using the characteristics of correspondence between acupuncture points and disease evidence,and to explore the dominance of acupuncture points in treating disease evidence and the rules of combination.At the same time,the development of computer technology has led to the informatization and technology in the medical field,and textual data such as Chinese medical literature,Chinese medicine formulas,and acupuncture points have become essential objects for excavation and research in the medical field.The ancient literature of acupuncture and moxibustion contains rich information of acupoints and disease evidence data.Mining the potential knowledge relationships in the text,using category classification to improve the accuracy of acupoints in treating diseases,and providing the best treatment group prescriptions for disease evidence have high practical value in practical clinical applications,but the redundancy and irregularity of the content of the literature have a certain impact on the classification accuracy.This paper adopts a research approach combining big data and natural language processing technology to firstly normalize the Chinese medicine data,and then classify the acupuncture data based on the Bert model to explore the association between acupoints and disease evidence more deeply from the similarity and relevance of acupuncture text statements,and to find the minor differences between categories.The main work of this paper is as follows:(1)Construction of acupuncture point database.First,data collection was conducted to collect data from ancient texts prior to October 1,1949,and modern literature after October 1,1949,using the ancient texts and literature search platform.Then,according to the research needs,inclusion and exclusion criteria were developed,and the collected documents were screened and sorted.Finally,the document-based data of ancient texts were stored in a Mongo DB database,and the manually annotated and collated structured data were stored in a SQL Server database.The stored textual data are stored separately according to the language types of literary and modern Chinese,and the collection of ancient books and the collection of modern documents are created in Mongo DB,and the table of ancient books and the table of modern documents are created in SQL Server.By combining the advantages of both databases,it is easy to browse and retrieve document information and realize the query of related documents according to keywords,and store massive data to provide a usable corpus for natural language processing and data mining,which is convenient for future work research.(2)Construction of Bert-Chinese-Acupoint language model.The Bert-Base-Chinese language model was improved,and the Bert-Chinese-Acupoint model was constructed by re-training on the basis of the acupoint database of acupuncture points.The corpus sentences were de-duplicated according to the available format of text data,and the frequency of each word occurrence was calculated to form a specific corpus dictionary,which was pre-trained using Google Cloud Platform.The model fine-tuning strategy is carried out using the acupoint training data to enhance the domain specificity and increase the semantic representation of specialized terms in acupoints in Chinese medicine.Experiments were conducted on the settings of model learning rate and number of training rounds in order to facilitate finding more suitable parameters for the acupoint classification task.The final results obtained were compared with the results of support vector machine,plain Bayesian,and long and short-term memory network classifiers,and the results showed that the Bert-Chinese-Acupoint model constructed in this paper was able to achieve an accuracy of 97% in acupoint classification.With the advantage of high classification accuracy,the Bert-Chinese-Acupoint model can be used to predict the acupoint category that best matches the disease and discover the best acupoint grouping by combining the relationship between acupoint categories,which can be applied to the recommendation of acupoints in acupuncture.The innovation points of this paper are as follows:(1)Combining Mongo DB and SQL Server database to build a document-based acupoint database and a structured acupoint database for acupoints in acupuncture.On the one hand,it provides corpus support for the pre-training phase of Bert-Chinese Acupoint model.On the other hand,it standardizes and systematizes the information of Chinese acupuncture point documents,and provides the basis for the next step of acupoint pattern mining and analysis.(2)Based on the Bert-Base-Chinese model,the Bert-Chinese-Acupoint model was obtained by re-training the collected acupuncture acupoint texts with the characteristics of the five acupoints corpus,and the pre-training process was completed in Google Cloud Platform over about two weeks.The classification prediction of acupoint categories was able to reach an accuracy of 97%,which is 3 percentage points higher than the accuracy of the comparison benchmark model. |