Font Size: a A A

Word Sense Disambiguation And Multidimensional Relationship Mining In Science And Technology

Posted on:2024-06-13Degree:MasterType:Thesis
Country:ChinaCandidate:H Q LiuFull Text:PDF
GTID:2568307151953369Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous progress of society,the state attaches more and more importance to the development of science and technology,and the types and quantities of science and technology data are increasing,and the forms in which science and technology data exist are also diversified.The science and technology data used in this study comes from the long-term accumulation of institutional data,results data,personnel data and special data from various business systems of the Hebei Provincial Institute of Science and Technology Information,which are scattered,duplicated,missing and not strongly related,making it difficult to form an overall advantage of services.How to effectively transform these complicated and irregular data into valuable and analyzable data is a problem that needs to be solved.This study takes science and technology data as the research object,and processes the cluttered data into relatable and valuable data through short text disambiguation and multi-dimensional relationship mining of science and technology data.The main research of the thesis is as follows:(1)Short text disambiguation in science and technology.Since science and technology data originate from different business platforms,the data have many homonyms,synonyms and near-synonyms conditions.In order to unify and standardise these data,this study selects identical or similar texts in science and technology data for disambiguation.These texts come from the multi-source data fused in the Hebei Science and Technology Innovation Data Comprehensive Service Platform,which are different from long texts with rich contextual semantics and largescale manual annotation,and are characterised by structured Chinese short texts,which have serious missing contextual feature values,and these Chinese short texts do not have tagging data.The lack of rich contextual semantics and the lack of tagging data are two problems that arise in the current short text disambiguation task.To address the problems in the short text disambiguation task,this study proposes a self-supervised hierarchical heterogeneous graph-based convolutional disambiguation model:firstly,using the feature of multi-field structural information of the data,the short text graph is constructed to introduce more semantic and syntactic information to describe the interactions between words,lexical labels,dimensions and entities,making the data embedding feature representation richer and improving the disambiguation accuracy;secondly,using the self-supervised approach,which does not rely on tagged text pairs to achieve disambiguation of short texts,solves the problem of lack of tagged data.The same standard validation set was used to compare the model with semi-supervised disambiguation models such as HGAT and HHGC.The experimental results demonstrate that the self-supervised model is able to achieve comparable results with the current excellent semi-supervised methods.(2)Multidimensional relationship mining for science and technology data.Based on the standard index data after disambiguation of Chinese short texts,associations are established between different source data to achieve multidimensional relationship mining of science and technology data.This study proposes a multimodal relationship mining model based on multimodality,which treats unstructured and structured datasets as two modalities and combines multimodality to mine multidimensional association relationships of science and technology data.The traditional analysis simply classifies the similarity comparison by explicit feature information,ignoring some standard dimensional relationships,implicit dimensional relationships that already exist.Therefore,the existing technology data mining methods have the following problems: only a single relationship triad exists,and there is a lack of further mining multi-dimensional data association between the triads.In order to solve the above problems,using science and technology data as the research target,compared to traditional analysis,this study combines the standard specification of structured data and the sufficient description and logical relationship of unstructured data,combining their advantages to build co-constructed information to realise the value of science and technology data.Firstly,the structured data and unstructured data are pre-processed;secondly,the pre-processed data are subjected to feature representation,modal fusion and other steps for feature extraction;finally,Bi-GRU network is used for association relationship mining analysis.In addition,to address the problem of multi-domain and multi-system science and technology data,with the amount of data increasing iteratively every year,a distributed learning model was introduced,using data parallelism to compress the training time by expanding the number of devices to achieve a nearly linear acceleration ratio.The experiments use standard S&T data from various official websites as validation sets,and the effectiveness of the method is demonstrated by comparing this research method with the traditional model,which is higher than the traditional method on different standard validation sets.
Keywords/Search Tags:science and technology data, intelligent systems, self-supervision, multimodality, disambiguation, multidimensional relationship mining
PDF Full Text Request
Related items