| Convolutional neural networks(CNNs)have been widely applied in various appli-cations,such as character recognition,image classification and nature language analy-sis.Because of the specific computation method of CNNs,general purpose processors are not efficient and can hardly achieve very high performance.In practice,graphic process units(GPUs)are widely used to accelerate the training and classification tasks of CNNs.However,they suffer from the energy inefficiency.Instead of GPUs,various CNN accelerators have been proposed recently based on field programming gate ar-ray(FPGA)and application specific integrated circuit(ASIC).Among these platforms,FPGA-based accelerators have become increasingly popular by the virtue of high re-configurability,fast turn-around time and better energy efficiency.However,there still exist many challenges.As we all known,FPGA platforms are constrained by limited hardware computing resource and expensive off-chip memory access,but in the state-of-the-art CNN models,there exist a large number of compu-tation operations(>1G)and a large number of weights(>50M),which needs consume a large number of energy.With the advance of technology,the increasing scale and complexity of CNN models to achieve higher accuracy further aggravate this situation,which would consume more energy consumption for computation and memory access.Therefore,an energy-efficient accelerator is needed.A CNN model has multiple convolutional layers,but in existing implementations,the same parallelism strategy is used for all convolutional layers,such a "one size fits all" approach may result in low resource utilization.In order to overcome this problem,we propose PiPe.Based on the work of PiPe,we found that both the convolutional layer and the fully connected layer can be converted into matrix multiplication calculation.In this way,a uniform processing engine can be designed to calculate the whole CNN model.Based on this finding,we proposed UniCNN.Specifically,we mainly make the following contributions:PiPe.In this work,we propose a pipelined energy-efficient accelerator for CNNs.The accelerator consists of multiple processing elements(PEs),each is responsi-ble for the computation of one layer in the network model.All the PEs are mapped on one chip so that different layers can work concurrently in a pipelined style.A methodology is proposed to balance each pipelined stage.For the memory-intensive fully connected(FC)layers,a pruning method and compressed sparse column(CSC)method are used to decrease the number of weights,which can save a lot of storage and computation.Moreover,a batch-based computing method is applied to the compressed data in order to reduce the required memory band-width.As a case study,we implement a large-scale CNN model,AlexNet,on two FPGA platforms,Zedboard and Virtex-7 boards,which have different hard-ware resource.The experiment results show that our proposed accelerators can achieve higher energy efficiency compared to previous accelerators.UniCNN.In this work,we propose a pipelined accelerator towards uniformed computing for CNNs.The accelerator converts the computation of concolutional layer to matrix multiplication by rearranging the input feature map on-the-fly;it also converts the computation of FC layer to matrix multiplication by using the batch-based method.Finally,a pipelined computation method is proposed to optimize the entire process.The experiment results show that our proposed accelerator can achieve higher computing resource utilization compared to the state-of-the-art.Based on PiPe and UniCNN,we have also provided the programming models of the two accelerators,which makes it convenient for the users to accelerate their applications using our proposed accelerators.In summary,this dissertation provides a high-performance,low-power,easy-to-use CNN solution based on a single FPGA chip. |