| The study of Chinese Sound and Meaning research requires convergence of materials with phonetic-semantic relationships to achieve morphophonetic-semantic interpolation,but the gain and loss of information at the interpretive level in the process of document transmission,such as training and interpretation methods,synonym substitution,and differences in word usage,make the manual association of similar interpretive texts less efficient and accurate.Therefore,it is necessary to find a deeper text similarity calculation solution to solve these problems.Text similarity calculation has been widely used in the fields of ancient Chinese search engines and precise pushing of documents.However,existing similarity algorithms have limitations such as feature dimensionality and redundant features when facing the ancient Chinese annotated corpus,resulting in less than ideal clustering results.To solve these problems,this study proposes a similarity calculation and text clustering method for ancient Chinese annotated texts.The method can help researchers quickly determine the possible differences in phonetic-syntactic matching,and realize the auxiliary work of form-phonetic-syntactic interpolation.In this study,we use the paraphrased texts in the ancient Chinese annotated corpus as the research content,and complete the automatic word separation and similarity calculation for the annotated corpus with the help of pre-trained language models,and realize the clustering and association of similar texts.The study of text similarity calculation for ancient Chinese annotated corpus needs to start from several aspects,such as corpus,construction of word separation model and similarity calculation methods.Firstly,we constructed a database of ancient Chinese annotated corpus,annotated the annotated texts into fields,and extracted the core fields of the annotated texts as the experimental corpus for similarity calculation and text clustering;Secondly,we focus on the problem that many current automatic word separation methods are not ideal for the ancient Chinese annotated corpus,improve the accuracy of manual word separation annotation by establishing word separation specifications,and fine-tune the pre-training model based on BERT neural network to fully integrate the text features of the annotated corpus,and realize an automatic word separation model for the field of ancient Chinese annotated corpus——Cishu BERT;Thirdly,based on the fine-tuned model of Cishu BERT,the similar annotated corpus was trained again with manual annotation to improve the feature learning ability of the model,and the similarity calculation and text class clustering for the ancient Chinese annotated corpus were successfully realized;Finally,the texts are clustered according to similarity,and a knowledge map of the ancient Chinese annotated corpus is constructed to realize the phonetic-semantic relationship representation,which assists researchers to discover potential phoneticsemantic relationships conveniently. |