Font Size: a A A

Research And Implementation Of Duplicate Data Management Technology Based On FastDFS

Posted on:2015-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2308330473451791Subject:Information security
Abstract/Summary:PDF Full Text Request
Along with the rapid development of computer technology, there is an explosive growth of digital information such that, especially in the cloud storage system, the amount of data even achieves the scale of PB level. Facing such a huge amount of data, the study of how to effectively find and eliminate duplicated data in the system becomes particularly important.Data chunking algorithm can quickly and efficiently detect the duplicated data among files. It is the core technology of the detection of same data. Addressing the problem that the chunking boundary of existing data chunking algorithms is uncertain which causes the data block to be too large and easy to produce data fragmentation, based on the principle of reducing the hard block in the system and balancing the contradiction of increasing the de-duplication rate and reducing the time consumption of data chunking algorithms, this thesis proposes a sliding window chunking method based on pre-chunking SWCDC. SWCDC employs a larger expected block value for chunking the region of the file whose contents didn’t change, while for the other region of the file it uses a smaller expected block value. By distinguishing the data from change region and from non-change region, SWCDC especially suits for the data duplication management systems which have much duplicated data. In addition, on the basis of SWCDC, in order to reduce the metadata overhead of the data block, this thesis proposes a sliding window chunking method based on merger ISWFDC. Experimental results show that, SWCDC and ISWFDC algorithms can achieve higher deduplication performance than conventional data chunking algorithms.To address the problem that the existing bloom filter is too slow when checking the large data block fingerprint set, and it cannot be well adapted to the dynamic growth of the data block fingerprint set in the cloud storage environment, this thesis proposes a dynamic bloom filter matrix set DBFMS. By means of representing the data block fingerprint set as individual matrixes which are constructed by bits, rather than individual bloom filter strings which are constructed by bits, so that the efficiency of retrieving duplicated data block fingerprint has been significantly improved. Theoretical analysis, simulation tests and experiments show that comparing with traditional static bloom filter and dynamic bloom filters, DBFMS have made good improvements including scalability, query efficiency and false positive probability.Finally, combining the duplicate data management theory, its system architecture model and the improved algorithms, duplicate data management platform which is based on FastDFS is implemented in the thesis by using the open source distributed file system FastDFS and configuring FastDFS cluster. The system complete the file upload, download, delete, rename and duplicated data management functions. By comparing the performances between the systems that applying the improved algorithm with the previous one, the experimental results show that the former one has better performance, higher efficiency, and therefore it is more suitable for cloud storage environment.
Keywords/Search Tags:duplication data management, same data detection, bloom filter, data chunking algorithm
PDF Full Text Request
Related items