Design And Implement An XML Retrieval Platform Based On Berkeley DB For XML Retrieval Research

Posted on:2015-06-11

Degree:Master

Type:Thesis

Country:China

Candidate:P Du

Full Text:PDF

GTID:2298330452459410

Subject:Information management and information systems

Abstract/Summary:

PDF Full Text Request

With the development of large Internet data, the traditional HTML informationretrieval canâ€™t meet the information demand which is always changing. As thestandard of exchange and represention of Internet data, XML has much advantage. Soitâ€™s more and more important, and will probably replace HTML. Itâ€™s urgent to solvethe problem of mining effective information from a large number of XML. Then XMLinformation retrieval will be the key technology for people to know and make use ofXML information. So I will design and test a platform of XML information retrievalengine based on BDB. It will be used for the purpose of the research of XMLretrieval.In this paper, we design and implement two XML retrieval algorithms based ondifferent storage scheme, which are XObject retrieval algorithm for multi-documentsand SLCAOffset retrieval algorithm, to provide the algorithm support for this XMLretrieval platform.In XObject algorithm, we utilized the embedded Berkeley DB as the backgrounddatabase, which performed one time better than the traditional relational database.Then we use VTD-XML mechanism to parse XML documents and come up with theXML structural cluster algorithm base on Trie tree match whose cluster performanceare better than popular cluster algorithm; in addition, we propose an tree parentstorage schema to optimize the cluster structural information storage; finally wedynamically constructed the query path, executed XQuery on BDB XML and storedthe returned results;In SLCAOffset algorithm, we designed and implemented the format of tokenrecord based on BDB and the block storage schema of XML document; then we comeup with the offset concept to solve the problem that XML documents consume toomuch space; furthermore, we get the returned object node by realizing the nodeaggregation algorithm; finally we read the result segments from the documents blockby the recorded offset and sort the segment.In result analysis, we conducted an experiment on the real data and representedthe detailed process and results of test, and the result show that our proposed clusteralgorithm has excellent performance on recall and precision; then we test the tree parent storage schema on typical datasets and the results show our scheme save about60%-70%storage space over path storage; finally we calculate the formula of blocksize by analyzing the distribution of the XML segments size.In our prospects, we put forward on spider application and an novel distributedarchitecture based on BDB.

Keywords/Search Tags:

Berkeley DB, XML structural cluster, XObject, SLCAOffset

PDF Full Text Request

Related items

1	Research On Berkeley DB For Cluster Server
2	Berkeley's immaterialism: An interpretation and critique
3	Berkeley's idealism: Arguments of the First Dialogue
4	Coleridge, Hartley, and Berkeley: Philosophy, religion, and politics, 1794--1796 (Samuel Taylor Coleridge, David Hartley, George Berkeley)
5	In defense of phenomenalism: Why Berkeley is not all wrong
6	A Design And Realization Of A Chinese Dictionary Based On B-tree And Berkeley Db
7	A Design And Realization Of A Chinese Dictionary Based On B-Tree And Berkeley DB
8	Design And Implementation Of Indexing Mechanism For Image Information Based On The Berkeley DB
9	Design And Implementation Of Indexing Mechanism For Image Information Based On The Berkeley Db
10	Integrity Designment Of Security And Reliability On Berkeley DB