| The generation of huge amount of data in the era of big data brings challenges to data storage.Data compression techniques help to save storage space and reduce communication overhead by using relatively few bits or other information units to represent the original data.Huffman coding is a classical entropy coding algorithm with high compression efficiency,fast coding speed and easy implementation,so it is widely used in many fields such as text data compression,audio-video compression and data security.However,because the traditional Huffman coding algorithm’s table lookup process processes the code words in a serial manner,resulting in a low coding throughput.This paper investigates the Huffman coding process and implements parallel table lookup using SIMD instruction set to improve the throughput of Huffman coding.To protect the privacy of users in the era of big data,privacy preserving techniques are used to process the data.In the data publishing phase,k-anonymization techniques are commonly used to desensitize the data.However,k-anonymization techniques reduce the risk of privacy leakage while generating additional data and increasing the pressure on data storage.In this paper,we use data compression techniques to reduce the storage cost of k-anonymized data.Based on Huffman coding implemented by SIMD,this paper proposes a compression storage scheme for k-anonymized raw data and anonymized data.The main research work and innovations of this paper can be summarized as follows:1.Huffman encoding algorithm based on AVX-512 instruction set is proposed.Huffman coding,as one of the entropy codes with high compression efficiency,is widely used in the entropy coding stage of various compression algorithms and software.Enhancing the throughput of Huffman encoding can improve the compression rate,however,the table lookup process of Huffman needs to scan the original data and perform code word replacement according to the coding table,and the serial processing process makes the throughput of coding limited.In this paper,the Huffman encoding process is optimized based on the SIMD instruction set,and the Huffman_simd algorithm is proposed.The encoding table of this algorithm does not need to store the code length and thus reduces the storage space,and the spatial parallelism of SIMD instructions is used to realize the parallel table lookup operation and data alignment of Huffman encoding,thus improving the encoding throughput.Simulation results show that the algorithm proposed in this paper has higher throughput compared to Huff0,the fastest Huffman implementation library available.2.The original data compression scheme and the anonymous data compression scheme for the k-anonymity model are proposed.The k-anonymity technique is widely used in the data publishing stage as a common privacy protection technique.In this paper,the compression scheme is proposed for the characteristics of k-anonymity anonymous data and original data respectively,for the anonymous data,the compression is performed by Huffman_simd algorithm according to its probability distribution;for the original data,because it has a strong correlation with the anonymous data,the pre-defined rules can be The difference between the two is calculated,and the difference has certain probability distribution characteristics,which is suitable for compression using Huffman_simd algorithm,so the difference is encoded to reduce the use of storage space,Simulation results show that the two compression schemes proposed in this paper can obtain a lower compression rate compared with the lzma compression tool,the zip compression tool for Windows 11,and the bzip2 compression tool. |