| Convolutional neural networks(CNNs)have rapidly developed as effective tools for various computer vision tasks.With the expectation of higher accuracy,higher efficiency,and lower resource consumption in these areas,improving these characteristics is becoming an important research focus for the CNN research community.To improve these important characteristics,this thesis takes advantage of the high parallelism available in Field-Programmable Gate Arrays(FPGAs)to reduce the latency of CNN inference.A pipelined CNN FPGA structure with flexible parallelism is used to trade off the compute time and the hardware utilization.Neural architecture search(NAS)is used to improve the accuracy by optimizing the architectures of CNNs.The main contributions of this thesis are as follows:(1)A Pipelined CNN FPGA Structure with Flexible Parallelism: The pipeline CNN FPGA accelerator structure is explored in this thesis as a way to reduce the latency of the CNN.Based on the difference in complexity of the target CNNs used,different pipeline hardware structures are implemented.(1)First,a pipelined structure for a network with a convolutional layer and a pooling layer is implemented.This implementation is able to process the pooling layer at the same time as processing the convolutional layer.The hardware implementation is able to process different parameters of CNNs.This implementation can reduce the latency by 37.5% compared to the designs without using a pipeline technology.ASIC is designed for this implementation.(2)For CNNs without much complexity,a fully pipelined layer,and fully parallel channel CNN hardware accelerator structure is implemented.This structure can achieve the lowest latency compared to the state-of-the-art.Timing optimization based on the timing of High Definition Multimedia Interface(HDMI)and quantization methods are used in this design to further reduce the latency and reduce the hardware utilization.Open Coherent Accelerator Processor Interface(Open CAPI)is used to ensure high bandwidth communication with the host processor.Multiple CNN instances are implemented in the system to increase throughput.The Le Net-5implemented in this thesis can achieve a 9.32μs latency and a 1.11 TOPs throughput under a 250 MHz clock with an accuracy of 98.8% on the MNIST dataset.Compared with state-of-the-art Le Net-5 implementations,the latency is reduced by 18.1%,and the throughput is increased by 2.5 times.(3)For CNNs with a higher complexity,a fully pipelined and semi-parallel channel convolutional neural network hardware accelerator structure is implemented.In this structure,the parallelism can be decided based on the available hardware resources of the target FPGA.NAS strategy is able to increase the accuracy of CNNs.The binary Res Net18 with NAS implemented in this design can achieve a 60.5%Top1 accuracy on Image Net.The latency of this implementation varies from 1.12 to6.33 ms,while the throughput varies from 4.56 to 0.71 TOPs under a 200 MHz clock.Compared with state-of-the-art Res Net18 implementations,the latency is reduced by8 times,and the throughput is increased by 1.9 times.(2)Applying NAS to Automated CNN Hardware Design Process: This thesis explores automating the design flow of CNN FPGA implementations,using the CNN FPGA accelerator compiler FINN.First,Le Net-5 models are implemented with the FINN compiler.In addition,this thesis integrates the NAS strategy with FINN to enable its automated implementation.Using this strategy,the implementation not only achieves a higher accuracy but also can achieve a better trade-off between the accuracy and the hardware utilization.Compared with the baseline Res Net18 models,this strategy can achieve up to 3.0% increase on Top1 accuracy and up to 2.2%increase on Top5 accuracy.For Res Net34 models,the increase can be up to 3.1% and2.1% for Top1 and Top5 accuracy,respectively.The best throughput of this implementation is 324.5fps.We used this automated design flow approach in a number of practical application domains and succeeded in achieving high accuracy in practice.The accuracy of the following applications was investigated: satellite image detection achieves 99.3% accuracy,Covid-19 x-ray image detection achieves 95.3%,plant disease detection achieves 94.7%,and flower type detection achieves 96.1%accuracy.(3)Analyzing the Advantages and Disadvantages between Manual and Automated Design Processes: This thesis compares the results of CNN implementations using a manual design process with the implementations using an automated design in the former parts of the thesis.This thesis compares the difference between design time,design flexibility,latency,throughput,accuracy,and hardware utilization.This thesis analyzes the benefits and drawbacks between manual and automated design processes.This can be a good reference for the design choices to be made in different situations in the future.In conclusion,this thesis explores a pipelined and flexible parallelism CNN FPGA structure and integrates the NAS strategy with automated implementation under the design goal of low latency,high accuracy and low hardware utilization.This thesis analyzes the pros and cons between different design processes.This thesis can be a good reference for the CNN FPGA designs in the future. |