Font Size: a A A

Research On Technologies For High-effect Data De-duplication

Posted on:2015-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:G H WangFull Text:PDF
GTID:1268330422981431Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, since human society has entered the era of information technology, storagesystems have saved more and more redundant information which is keeping increasing timelywith the explosion of digital data. This redundant information not only occupies more storagespace, but also decreases the performance of storage system and increases the cost of datamanagement. Therefore, it is extraordinarily significant to do some research on data reductiontechniques to eliminate the duplicate data for optimizing and improving the performance ofstorage systems.Data de-duplication, a kind of data reduction technique, can increase storage spaceutilization and reduce the cost of data management by deleting a large number of redundantdata. It has now become a hot research topic in the field of computer storage.Currently, the main technical challenge of data de-duplication technology is theapproaches to improve the performance of the storage system by enhancing the efficiency ofdata de-duplication. The efficiency, which is a critical factor in improving the utilization ofstorage space and optimizing the performance of the storage system, mainly consists of threemain factors as the strategies, de-duplicate ratio and detection speed of data de-duplication.This dissertation investigates the approaches to enhance the efficiency of data de-duplication,especially focusing on data de-duplication architecture, global data de-duplication strategy,memory index method, and data duplicate detection method based on the pipeline. The mainresearch works and innovations are as follows:(1) To improve the poor scalability of the traditional de-duplication architecture, aclustered two-level data de-duplication architecture (CTDDA) is proposed. CTDDA iscomposed of client, metadata server and multiple storage nodes, in which new nodes can beadded as needed whenever to expand the system capacity. CTDDA supports both file-leveland chunk-level data de-duplication. It firstly eliminates duplicate files by metadata server,and distributes the non-duplicate files to each node evenly and de-duplicates at chunk level inparallel. The data de-duplication efficiency can then be substantially enhanced with thesetwo-level data de-duplication architecture and parallel operations in all nodes.(2) In order to eliminate redundant data among storage nodes, a global datade-duplication strategy based on Bloom Filter (GDDSBF) is suggested. To prevent each nodein CTDDA eliminates duplicate data locally, GDDSBF creates a fingerprint summary vectorfor each node using Bloom Filter, and all the vectors are gathered to generate a globalfingerprint summary array (FSA). Thus, each node can detect duplicate data globally by searching in FSA, and achieve a high data de-duplication ratio. Furthermore, when a newnode is added to the storage cluster, GDDSBF will extend the detection range to all nodesincluding the new node by inserting the fingerprint summary vector of the new node into FSA.Theory analysis and experimental results show that the GDDSBF can delete more redundantdata and attain a higher data de-duplication ratio than the local data de-duplication strategy.Therefore, it improves the space utilization of storage systems.(3) In order to accelerate duplicate data detection, a memory index based on the hashtable (MIMHT) is presented. In the process of data de-duplication, data index is generallyused to detect duplicate data. With the increasing of data, the index will become very hugeand sometimes may beyond the range of memory, so the index should be saved in disks. Toalleviate disk I/O bottleneck during the duplicate detecting, MIMHT reads the hot part of theindex from a disk to memory and creates a memory index based on a hash table. The indexentries belonging to the same container are connected by a circular linked list. Thus, thereading and replacement of index entries in MIMHT is in utits of container by which a higherhit rate can be achieved and the disk index access frequency will be reduced. Theexperimental analysis shows that MIMHT has a higher hit rate and detection speed thanDDFS (Data Domain File System) and grouping prediction method based on the undirectedgraph. It improves the I/O performance of storage systems.(4) Combining FSA and memory index and dividing the duplicate detection process intomultiple stages, we propose a duplicate data detection method based on pipeline (DDDMP).DDDMP can further accelerate the process of duplicate detection inside each node usingpipeline. Double buffer queues are utilized to synchronize the threads of adjacent pipelinestages, and the memory index querying stage which may cause pipeline stall is optimized byusing multiple threads. The experimental results show that DDDMP is significantly superiorto the sequential method, because it can further accelerate the duplicate detection and improvethe data de-duplication efficiency, as well as the performance of the overall system.
Keywords/Search Tags:Data de-duplication, Fingerprint summary array, Memory index, Pipeline
PDF Full Text Request
Related items