Research On Key Techniques Of Distributed Data Processing And Storage

Posted on:2009-05-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L H Yu

Full Text:PDF

GTID:1118360242483023

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of information and communication technology, hardware and software cost micro-dollars, the manpower cost becomes the major part of TCO (Total Cost of Ownership), and the amount of data digitalized and stored is growing quickly. Storage system today is faced with the following new research challenges and opportunities: rich metadata and effective data retrieval methods are essential; storage is better as a service; automatic storage system is emphasized. Motivated by the above challenges and opportunities, we propose a distributed data storage and management system based on inter-object relationship. This thesis mainly focuses on its architecture, and its fundamental subsystems and key techniques, including query and retrieval, distributed storage, automatic optimization.fsFirst, we present the architecture of the distributed data storage and management system. The inter-object relationship based semantic model and the concepts of object and inter-object relationship is presented. Then, the query and retrieval language and query processing is described. After that the system and sub-system architecture is briefly introduced.Based on peer-to-peer and object-based storage technology, we designed a container-based distributed object storage. In the container-based storage model, a container which is the unit of data placement and replication, manages a set of storage objects and is responsible for dirty tasks such as block allocations. The container-based storage model reduces metadata size, simplifies system design, and hence improves storage system scalability. The system maintains runtime metadata by self-organizing peer-to-peer technology, and handles server failure and addition transparently. It achieves reliability by primary/slave container replication with dynamic primary election, achieves consistency by state-based object access and replica healing.Distance index is an essential data structure of query processing in data mangement systems. However, the creation and query performance of existing index methods is far from perfect. We propose two indexes for directed graph: DIX-C with constant query time, DIX-2HC with smaller index size, and also describe their undirected version UDIX-AP and UDIX-2HC. The related-query processing algorithm is then presented based on these distance indexes and interval encoding. The experiment results show that our indexes outperform previous methods, and that related-query processing algorithm is very efficient.Existing access correlation discovery methods often rely on support to prune search space, and hence can't detect many valuable correlations with low support. Furthermore, they are not scalable enough to work in distributed environment, and none of them can detect inter-server correlations which is prevalent in distributed storage system. Therefore, we propose access correlation mining algorithms, namely HCM, VCM and PFC-Miner, which employ correlation confidence as the primary interest measure. HCM is more efficient, VCM can run incrementally, and both of them work only on stand-alone server. In contrast, PFC-Miner is a distributed approximate mining algorithm which is very scalable and able to discover inter-server correlations. Experimental results demonstrate the performance of proposed algorithms, and also show that the mined access correlation can be utilized to improve cache hit ratio.In storage system, files usually have large number of replicas and even more highly similar replicas. While existing methods consider only identical replicas, we propose a keyword extraction method PAKE which detects files with high content similarity and subsequently extracts keywords from their names. The experiment results show that PAKE can improve retrieval significantly compared to previous methods.

Keywords/Search Tags:

distributed storage, semantic file system, personal information management, object-based storage, peer-to-peer, graph index, data mining, automatic storage system

PDF Full Text Request

Related items

1	Research On Global Persistent Object Storage System Based On Peer-to-Peer Network
2	Study On Data Management Technology Of Peer-to-Peer Distributed File System
3	Researching And Implementation Of A Distributed Storage System Based On Peer-to-Peer Architecture
4	Distributed Storage Research Of Spatial Data Over Peer-To-Peer Networks
5	Research On P2P Storage System Based On CAN
6	Research On Replication Management Technology In P2P File Storage System
7	Design And Implementation Of Distributed Storage System Based On Peer-to-peer
8	Research On Data Management In Peer-to-peer Storage System
9	Research And Implementation Of The Key Technologies In P2P Secure Storage System
10	P2p-based Network Storage Technology Research