Font Size: a A A

Research On Fundamental Theory And Optimization Of Differential Compression

Posted on:2022-03-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J HuFull Text:PDF
GTID:1488306602493654Subject:Military communications science
Abstract/Summary:PDF Full Text Request
With the development of computer technology and Internet of Things(Io T)technology,the amount of global data is increasing at an alarming rate,and there are more and more types.These data involve all aspects of human life,including logistics,retail,medical and health,Airport security,transportation,environmental testing,etc.By mining and analyzing these data,people can discover hidden insights,reveal the mysteries of nature,and bring huge benefits to the world.Therefore,related research on massive data has become the focus of the academic and industrial circles,and has even been upgraded to national development strategies,such as the “e-Europe”,“smart earth”and “perceive China”Strategy etc.However,before the massive data can be processed and generate value,it needs to be temporarily or permanently stored.It can be said that data storage is the basic process and the base camp for storing the original data,intermediate results,and final results.Unfortunately,the current storage capacity growth has not made progress as much as data production,which leads to the vast majority of massive data that may be discarded.Although the selling price of storage devices has been declining,it is very difficult for them to compete with exponentially increasing data speed.Therefore,the advancement and innovation of storage theory and technology is the key way to overcome the loss of massive data and information.Massive data usually comes from multiple heterogeneous sensing device arrays,which is a real-time picture of the state of the real world.Due to the repeated sampling of the equipment,a large amount of redundant data is caused,which further increases the demand for storage space and network traffic.As a lossless compression technology,differential compression can more intelligently carry out more efficient compression for files with common segments.Specifically,differential compression realizes data compression by finding matching segments between data and editing it into a delta sequence of data relative to another data,saving redundant data storage space and data transmission time.This thesis conducts research on the fundamental theory and optimization of differential compression.The main research work and contributions are summarized as follows.1、 At present,the related research of differential compression mainly focuses on algorithm development and technical application.In contrast,the research of fundamental theory is relatively lacking.This thesis first starts with the concept of differential compression,defines the Delta encoding sequence.Considering the construction method,the referenceediting construction model and the empty-editing construction model are proposed.Then,the differential compression is classified from the reconstruction space,the spatio-temporal relationship and the object type,and twelve researchable types of differential compression are derived from this.Finally,the definition of differential compression edited by COPY and ADD operations and two expressions of Delta encoding sequence are given.The fundamental theoretical research of differential compression has certain guiding significance for the follow-up work.2、 In view of the problem of poor detection effect due to the omission of shared fragments during the commonality detection between files,first,the shared fragment set is defined,and the relationship of the shared fragments is divided into separated,included,and overlapped according to their positions.Then,we propose to use the total length of all separated shared fragments to evaluate the commonality between files,and design a sharing fragment set(SFS)algorithm to find the cascading sequences and separate them to obtain a better sharing fragment set.Theoretical analysis and experimental simulation show that the sharing fragment set generated by the SFS algorithm contains all the separated sharing fragments,and the commonality measurement result is respectively about 10% and 4% higher than the greedy string tiling(GST)algorithm and the Greedy algorithm,which provides a prerequisite for the implementation of differential compression.3、 The current differential compression algorithm either only focuses on the longer matching fragments and ignores the shorter matching fragments,or only pursues the time and space complexity and directly discards the matching fragments,which leads to its poor performance in saving storage space.In view of the above drawbacks,this article considers the storage methods of COPY and ADD operations in the computer system.First,a new cost model of differential compression is defined,and the total length of matching fragments and the number of matching fragments are proposed to evaluate the Delta encoding sequence.Then,considering the maximization of the total length and the minimization of the total number of the matched fragments,the maximal total length of copied fragments(MTLC)algorithm is constructed,which obtains an Delta encoding sequence by identifying the longest matching segment at each offset,reasonably splitting the overlapping part,and editing with COPY and ADD operations.Theoretical analysis and experimental results show that the Delta encoding sequence generated by the MTLC algorithm maximizes the total length of the matched fragments and contains the least COPY operations,which is better than the Delta encoding sequence generated by the Greedy and Hsadelta algorithms.4、 In view of the shortcomings of insufficient efficiency and compactness of the Delta encoding sequence,the previously defined cost model is used.First,the cumulative cost of all COPY and ADD operations is proposed to evaluate the Delta encoding sequence.Secondly,the subsequences are divided into four types according to the number and positions of the ADD operations contained,and the corresponding merging rules are derived.Then,using the idea of divide and conquer,the minimum Delta encoding cost(MDC)algorithm is designed.Through reasonable decomposition and merging of the Delta encoding sequence,the correction of the Delta encoding sequence is completed.Theoretical analysis and experimental results show that the Delta encoding sequence modified by the MDC algorithm minimizes the cumulative cost,and contains fewer operations,which is more advantageous in terms of saving storage space.Moreover,this Delta sequence correction method is a universal method that can be suffixed after any COPY/ADD differential compression algorithm to achieve secondary compression,and further improve the efficiency of the computer system to store data.
Keywords/Search Tags:Data storage, Differential compression, Fundamental theory, File commonality detection, Delta encoding sequence, Divide and conquer
PDF Full Text Request
Related items