| With the rapid development of information technology and the popularization of the Internet of Things(Io T),various resource-constrained devices have been widely deployed.The highly optimized and simplified hardware implementations of standard cryptographic algorithms are important measures to cope with the security demand for resource-constrained devices,and this has been a hot topic in cryptography.As the National Standard of Commercial Cryptography in China,SM4 block cipher occupies an extremely important position in the industry of cryptography,and has been widely used in governments,banks,electrical power,etc.Therefore,it is crucial to research lightweight implementations of SM4.In order to make SM4 more applicable to resource-constrained environments,this thesis presents two techniques,called “split-and-join”and “off-peak and stagger”,to optimize the bit-serial architecture of(generalized)Feistel block ciphers.The“split-and-join”technique focuses on saving area consumptions.This technique splits the operation of SM4’s linear layer into 32 parts based on the output of the nonlinear substitution.Furthermore,each output bit is then XORed to particular bits in the 32-bit state which needs to be updated in the current round to complete the join.This allows the linear transformation and the branch XOR operation of the Feistel structure to be performed simultaneously.Compared with the general bit-serial architecture,the “split-and-join”technique greatly reduces the area consumption of SM4’s hardware implementations for two aspects.On one hand,it reduces the number of Scan Flip-Flops used to store intermediate values from 64 to 8.On the other hand,it reduces the number of XOR gates required for the linear layer from 194 to 8.Besides,this thesis presents the “off-peak and stagger”technique to reduce the latency while maintaining the area at a low level.The “off-peak and stagger”technique adjusts the approach to obtain the S-box output in different clock cycles,and this removes the conflict caused by using the S-box circuit at the same time when performing the encryption and key schedule procedures,thus providing a basis for reducing the latency while keeping the number of S-box circuits unchanged.Furthermore,when a small amount of extra area is allowed,this technique can adjust the XOR operation in a staggered way to solve the limitation where the state and the round key can only be cyclically shifted within the 32-bit shift register during the linear transformation.Thus,the number of clock cycles for one iteration is reduced from 64 to 32.The implementation architecture requires additional 16 registers for storing intermediate values,of which 15 are Scan Flip-Flops.The number of XOR gates required for the linear layer is 14.That is,the “off-peak and stagger”technique reduces the latency by 992 with a modest increase in area consumptions.It is worth noting that this is the smallest latency attainable for SM4 based on a bit-serial architecture with a single S-box circuit.Finally,this thesis further discusses how to reduce the latency by increasing the data path width with a small area overhead.Two implementation architectures with data path widths of 2 and 4 are proposed respectively,which provide a wider area-latency trade-offs for SM4.When simulating our proposed architectures in Synopsys Design Compiler K-2015.06,the results show that the area-minimized architecture with the “split-and-join”technique consumes only 1771 GE and 2336 clock cycles to complete one encryption for 128 bits under TSMC 90 nm process library,and the throughput can reach 36.5 Mbps.With the application of the“off-peak and stagger”technique,the area-latency trade-off scheme successfully reduces the number of clock cycles from 2339 to 1344,with a throughput rate of up to 47.6 Mbps under TSMC 90 nm process library,and the required number of equivalent gates is 1861 GE.When increasing the data path width,the latency decreases exponentially,with up to 6% increase in area.This makes the SM4 highly competitive in the application of resource-constrained devices even compared with those specifically designed lightweight block ciphers. |