Font Size: a A A

Research On RDF Subgraph Matching Method In Distributed Environment

Posted on:2021-01-20Degree:MasterType:Thesis
Country:ChinaCandidate:W K XingFull Text:PDF
GTID:2370330602989121Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the rise of knowledge map,more and more data sets use the format of resource description framework(RDF)to publish and maintain data.Because of the natural graph structure model of RDF data,the problem of SPARQL retrieving RDF data can be transformed into the problem of subgraph matching on large graph.With the release of a large number of RDF data,the retrieval of RDF data by SPARQL query is beyond the limit of single machine processing capacity,and the distributed graph matching method is increasingly concerned.In distributed RDF query processing,due to the increasing scale and complexity of query graph,its complex structure makes query optimization face double challenges of query accuracy and performance.In view of the above problems,this paper compares the advantages and disadvantages of the current mainstream query optimization scheme of distributed RDF graph.After theoretical analysis and experimental' verification,this paper proposes a structure dominated distributed subgraph matching optimization method.The main research work of this paper is as follows:First,preprocess the data.Through jena2,the OWL ontology data set is extracted and processed to obtain the RDF metadata in NT format,which is easy to process.The long metadata information is compressed and stored in the form of integer ID by dictionary encoding technology.The abstract statistical graph mode of this paper is proposed,and the relevant data required for the calculation of the cost model proposed in this paper is collected in advance by using the type based 'data statistics mode Set and statistics;put forward the partition method of data graph and index structure of data storage based on memory according to the strategy of graph exploration,divide the compressed data into hash partitions according to the plastic ID,and each computing node stores the allocated data segment based on the underlying key value index structure,and build virtual type/predicate node to do so Inverted index to speed up query.Then,the query graph is processed,the structure of query graph is decomposed and the query plan is made.The CPM node decomposition model of query graph is proposed to make full use of the matching characteristics of each part of the structure of query graph in the distributed environment to speed up the query;the cost model with node as the core is proposed to transform the complex graph exploration problem into query execution tree problem by combining the idea of minimum spanning tree on the weighted query structure graph calculated by summary statistical data,Get efficient query execution sequence.Finally,the query plan is passed to each computing node,and the matching task is started on all computing nodes using graph exploration pattern.The optimization strategy of delaying Cartesian product operation is proposed to compress the number of paths containing the whole history information in kernel structure matching.The strategy of using structure decomposition to divide the matching process is proposed to make the path structure matching process can be executed in parallel at high speed without redundancy,and the final matching result can be obtained through lightweight connection on the host computer.
Keywords/Search Tags:Distribution, RDF Graph, Pattern Matching, Structure Decomposition Type-Centric Statistics, Query Processing
PDF Full Text Request
Related items