| Deep neural networks are developing rapidly and being increasingly applied in network security.Due to their over-parameterised nature,neural networks suffer from high storage and computational resource requirements,as well as time-consuming operations.This may lead to security risks in scenarios where real-time processing is required.Neural network pruning can theoretically reduce model redundancy,decreas computational complexity,and thus reduce inference latency.Among existing pruning methods,structured pruning can achieve more efficient inference,while unstructured pruning can maintain better accuracy.In order to combine the advantages of both,researchers have proposed more pruning patterns that lie between the two.The current mainstream pruning patterns have the following drawbacks:1)After the model becomes a sparse model,it needs to use a sparse computing library for calculations.However,the current sparse computing library is not fast enough to achieve inference acceleration and it cannot even compete with the unpruned model.2)There are multiple implementations for the same pruning pattern,but they all require specific hardware or rewritten computing libraries,leading to poor portability to be used on multiple hardware platforms.Therefore,although the new pruning patterns is proposed to obtain a more efficient and sparse neural network model with high accuracy,the practical results are poor.This thesis focuses on hardware-friendly high-accuracy pruning methods and carries out a series of research to address the aforementioned issues.The research mainly includes the following two aspects:(1)Binary mask-based inference latency optimization for group-wise pruning patternThe current pruning work focus more on reducing computation,using floating point computations(Flops)as the pruning acceleration metric.However,in situations where inference speed is the real optimization target,neural network inference latency should be the main metric of interest,rather than just focusing on reducing computational complexity in terms of floating point operations(Flops).Therefore,this work primarily measures the performance of different pruning patterns in terms of inference time.This work selects several mainstream pruning patterns and conduct model inference experiments on different neural network models and datasets.The experiments are computed using a off-the-shelf sparse computing libraries to measure the inference latency.The results show that the inference speed of the pruned sparse network model is even slower than that of the dense models.Inspired by the above experiments,this thesis implements an efficient group-wise pruning inference method.This method reassembles the pruned sparse models,converts the calculation method,and optimizes data storage access according to hardware characteristics.According to the model execution process,set the initialization step of the model inference stage,and reduce the repeated redundant steps in the model inference.Through the above operations,the sparse model can be more adaptable to the current hardware computing,thus effectively shortening the model inference latency,and has high portability.The model can be deployed on any hardware platform that can perform intensive computing without manually modifying the computing library.(2)Layer-sparse adaptive group-wise pruning algorithm based on binary mask Although the group-wise pruning pattern can be combined with any pruning algorithm to train a sparse model,it may not always result in a high accuracy.Therefore,in order to make better use of the group-wise pruning pattern,this thesis proposes a pruning algorithm combined with this pruning pattern,namely layer-sparse adaptive group-wise pruning algorithm based on binary mask.The algorithm uses a binary mask to mark the pruned part,and automatically determines the sparsity of different network layers during training.This approach elaborately distinguishes the sensitivity to pruning of different layers in the model,in order to better maintain the accuracy of the model.The determination of redundant weights not only relies one the current weights,but also considers the possibility of subsequent weight changes to avoid the removal of important weights in subsequent training,which may lead to significant loss of accuracy. |