Font Size: a A A

Research On Key Techniques Of Distributed Data Processing And Storage

Posted on:2009-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:L H YuFull Text:PDF
GTID:1118360242483023Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information and communication technology, hardware and software cost micro-dollars, the manpower cost becomes the major part of TCO (Total Cost of Ownership), and the amount of data digitalized and stored is growing quickly. Storage system today is faced with the following new research challenges and opportunities: rich metadata and effective data retrieval methods are essential; storage is better as a service; automatic storage system is emphasized. Motivated by the above challenges and opportunities, we propose a distributed data storage and management system based on inter-object relationship. This thesis mainly focuses on its architecture, and its fundamental subsystems and key techniques, including query and retrieval, distributed storage, automatic optimization.fsFirst, we present the architecture of the distributed data storage and management system. The inter-object relationship based semantic model and the concepts of object and inter-object relationship is presented. Then, the query and retrieval language and query processing is described. After that the system and sub-system architecture is briefly introduced.Based on peer-to-peer and object-based storage technology, we designed a container-based distributed object storage. In the container-based storage model, a container which is the unit of data placement and replication, manages a set of storage objects and is responsible for dirty tasks such as block allocations. The container-based storage model reduces metadata size, simplifies system design, and hence improves storage system scalability. The system maintains runtime metadata by self-organizing peer-to-peer technology, and handles server failure and addition transparently. It achieves reliability by primary/slave container replication with dynamic primary election, achieves consistency by state-based object access and replica healing.Distance index is an essential data structure of query processing in data mangement systems. However, the creation and query performance of existing index methods is far from perfect. We propose two indexes for directed graph: DIX-C with constant query time, DIX-2HC with smaller index size, and also describe their undirected version UDIX-AP and UDIX-2HC. The related-query processing algorithm is then presented based on these distance indexes and interval encoding. The experiment results show that our indexes outperform previous methods, and that related-query processing algorithm is very efficient.Existing access correlation discovery methods often rely on support to prune search space, and hence can't detect many valuable correlations with low support. Furthermore, they are not scalable enough to work in distributed environment, and none of them can detect inter-server correlations which is prevalent in distributed storage system. Therefore, we propose access correlation mining algorithms, namely HCM, VCM and PFC-Miner, which employ correlation confidence as the primary interest measure. HCM is more efficient, VCM can run incrementally, and both of them work only on stand-alone server. In contrast, PFC-Miner is a distributed approximate mining algorithm which is very scalable and able to discover inter-server correlations. Experimental results demonstrate the performance of proposed algorithms, and also show that the mined access correlation can be utilized to improve cache hit ratio.In storage system, files usually have large number of replicas and even more highly similar replicas. While existing methods consider only identical replicas, we propose a keyword extraction method PAKE which detects files with high content similarity and subsequently extracts keywords from their names. The experiment results show that PAKE can improve retrieval significantly compared to previous methods.
Keywords/Search Tags:distributed storage, semantic file system, personal information management, object-based storage, peer-to-peer, graph index, data mining, automatic storage system
PDF Full Text Request
Related items