| At present,most studies focus on the retrieval method of a single formula or text in the literature.Few studies use the entire literature as a unit to match the similarity between scientific and technological documents,because it involves a comprehensive analysis of the similarity of multiple formulas and texts.The difficulty of calculating the similarity between scientific and technological documents lies in the reasonable and effective integration of mathematical formula features and text features in the documents.Therefore,this thesis discusses the influence of formulas and their surrounding texts in scientific and technological documents on document similarity,integrating formulas and texts to propose a document similarity analysis model based on formula comparison.Based on this,two fusion models combining the text surrounding the formula are constructed to measure the similarity of documents.The work of this thesis is summarized as follows:(1)Due to the particularity of the formula structure,this thesis uses the word embedding model to explore the document similarity analysis model based on formula comparison.The vectorized processing of the formula adopts Tangent-CFT to obtain the feature vector of the formula.At the same time,this thesis extracts the formula position,the frequency of formula appearance,the number of full-text formulas,and the length of the context before and after the formula to calculate the importance score of the formula.In the experiment of formula mixed comparison,three factors are considered,including the weight distribution of the importance of formulas,the matching principle of formula similarity,and the selection range of formulas.The Similarity of documents is analyzed by combining different schemes.The experimental results show that using the importance weight distribution method of 0 and 1 for a fixed number of formulas in the document,and using the best global similarity method to match the formulas to obtain the best results for similar document retrieval tasks.(2)Based on the best solution obtained from the above experiments,the text information around the formula is introduced to propose two document similarity analysis models based on the mixed comparison of formula and text.The two models are combined with long text around the formula and surrounding keywords.For long texts,the keyword extraction algorithm is used to obtain keyword information,and the word embedding model is used to obtain feature vectors for two kinds of texts.This thesis also considers the influence of text extraction length,the method of stitching formula vectors and text vectors is compared with their weighted vectors.Calculates the similarity of documents by using the similarity of "formula-context" pairs in documents.The experimental results show that when the formula vector and the keyword vector are weighted in a ratio of 4:6,the mixed vector obtains the best results in the document Kmeans clustering and KNN classification experiments.(3)For the above methods,we extract scientific and technological documents under the Computer Science(CS)category from the arXMLiv dataset,and obtain a labeled document dataset according to the classification on the arXiv.org website for experiments.The experimental results verify the effectiveness of using formula similarity to measure document similarity.At the same time,the hybrid comparison method combined with text is better than pure formula comparison,which proves that the fusion of formula contextual text information can supplement the mathematical semantics of formulas and improve the effect of document similarity analysis. |