Research On The Similarity Model Of Scientific And Technological Documents With Mathematical Formulas And Contexts

Posted on:2024-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Shen

Full Text:PDF

GTID:2530307052495934

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

At present,most studies focus on the retrieval method of a single formula or text in the literature.Few studies use the entire literature as a unit to match the similarity between scientific and technological documents,because it involves a comprehensive analysis of the similarity of multiple formulas and texts.The difficulty of calculating the similarity between scientific and technological documents lies in the reasonable and effective integration of mathematical formula features and text features in the documents.Therefore,this thesis discusses the influence of formulas and their surrounding texts in scientific and technological documents on document similarity,integrating formulas and texts to propose a document similarity analysis model based on formula comparison.Based on this,two fusion models combining the text surrounding the formula are constructed to measure the similarity of documents.The work of this thesis is summarized as follows:(1)Due to the particularity of the formula structure,this thesis uses the word embedding model to explore the document similarity analysis model based on formula comparison.The vectorized processing of the formula adopts Tangent-CFT to obtain the feature vector of the formula.At the same time,this thesis extracts the formula position,the frequency of formula appearance,the number of full-text formulas,and the length of the context before and after the formula to calculate the importance score of the formula.In the experiment of formula mixed comparison,three factors are considered,including the weight distribution of the importance of formulas,the matching principle of formula similarity,and the selection range of formulas.The Similarity of documents is analyzed by combining different schemes.The experimental results show that using the importance weight distribution method of 0 and 1 for a fixed number of formulas in the document,and using the best global similarity method to match the formulas to obtain the best results for similar document retrieval tasks.(2)Based on the best solution obtained from the above experiments,the text information around the formula is introduced to propose two document similarity analysis models based on the mixed comparison of formula and text.The two models are combined with long text around the formula and surrounding keywords.For long texts,the keyword extraction algorithm is used to obtain keyword information,and the word embedding model is used to obtain feature vectors for two kinds of texts.This thesis also considers the influence of text extraction length,the method of stitching formula vectors and text vectors is compared with their weighted vectors.Calculates the similarity of documents by using the similarity of "formula-context" pairs in documents.The experimental results show that when the formula vector and the keyword vector are weighted in a ratio of 4:6,the mixed vector obtains the best results in the document Kmeans clustering and KNN classification experiments.(3)For the above methods,we extract scientific and technological documents under the Computer Science(CS)category from the arXMLiv dataset,and obtain a labeled document dataset according to the classification on the arXiv.org website for experiments.The experimental results verify the effectiveness of using formula similarity to measure document similarity.At the same time,the hybrid comparison method combined with text is better than pure formula comparison,which proves that the fusion of formula contextual text information can supplement the mathematical semantics of formulas and improve the effect of document similarity analysis.

Keywords/Search Tags:

Technical Document Similarity, Formula Similarity, Word Embedding, Feature Vector, Mathematical Formula Embedding

PDF Full Text Request

Related items

1	Key Technology And Application Of Understanding Elementary Mathematical Problem Based On Word Embedding
2	Research On Feature-based Large-scale Graph Similarity
3	Study On Link Prediction Based On Network Embedding And Transfer Similarity Methods
4	Prediction Of Type Ⅲ Secreted Effectors Based On Word Embedding And Deep Learning
5	The Binning Of Metagenomic Sequence Based On Statistical Model And Word Embedding
6	Research On The Method And System Of Transform Mathematical Formula To Braille
7	Research On Spatial Similarity Calculating Method Between GML Documents
8	Identification Of Protein-protein Interaction Based On Relational Similarity Of The Text
9	Identification Of Protein-protein Interaction Based On The Constraint Of Semantic Similarity On Context
10	A Novel Generative Topic Embedding Model By Introducing Network Communities