Design And Implement An XML Retrieval Platform Based On Berkeley DB For XML Retrieval Research | | Posted on:2015-06-11 | Degree:Master | Type:Thesis | | Country:China | Candidate:P Du | Full Text:PDF | | GTID:2298330452459410 | Subject:Information management and information systems | | Abstract/Summary: | PDF Full Text Request | | With the development of large Internet data, the traditional HTML informationretrieval can’t meet the information demand which is always changing. As thestandard of exchange and represention of Internet data, XML has much advantage. Soit’s more and more important, and will probably replace HTML. It’s urgent to solvethe problem of mining effective information from a large number of XML. Then XMLinformation retrieval will be the key technology for people to know and make use ofXML information. So I will design and test a platform of XML information retrievalengine based on BDB. It will be used for the purpose of the research of XMLretrieval.In this paper, we design and implement two XML retrieval algorithms based ondifferent storage scheme, which are XObject retrieval algorithm for multi-documentsand SLCAOffset retrieval algorithm, to provide the algorithm support for this XMLretrieval platform.In XObject algorithm, we utilized the embedded Berkeley DB as the backgrounddatabase, which performed one time better than the traditional relational database.Then we use VTD-XML mechanism to parse XML documents and come up with theXML structural cluster algorithm base on Trie tree match whose cluster performanceare better than popular cluster algorithm; in addition, we propose an tree parentstorage schema to optimize the cluster structural information storage; finally wedynamically constructed the query path, executed XQuery on BDB XML and storedthe returned results;In SLCAOffset algorithm, we designed and implemented the format of tokenrecord based on BDB and the block storage schema of XML document; then we comeup with the offset concept to solve the problem that XML documents consume toomuch space; furthermore, we get the returned object node by realizing the nodeaggregation algorithm; finally we read the result segments from the documents blockby the recorded offset and sort the segment.In result analysis, we conducted an experiment on the real data and representedthe detailed process and results of test, and the result show that our proposed clusteralgorithm has excellent performance on recall and precision; then we test the tree parent storage schema on typical datasets and the results show our scheme save about60%-70%storage space over path storage; finally we calculate the formula of blocksize by analyzing the distribution of the XML segments size.In our prospects, we put forward on spider application and an novel distributedarchitecture based on BDB. | | Keywords/Search Tags: | Berkeley DB, XML structural cluster, XObject, SLCAOffset | PDF Full Text Request | Related items |
| |
|