Design Methods For Optimized CNN Hardware Using Parallelization,Pipelining And Architecture Search

Posted on:2024-10-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M F Ji

Full Text:PDF

GTID:1528307340476154

Subject:Microelectronics and Solid State Electronics

Abstract/Summary:

PDF Full Text Request

Convolutional neural networks(CNNs)have rapidly developed as effective tools for various computer vision tasks.With the expectation of higher accuracy,higher efficiency,and lower resource consumption in these areas,improving these characteristics is becoming an important research focus for the CNN research community.To improve these important characteristics,this thesis takes advantage of the high parallelism available in Field-Programmable Gate Arrays(FPGAs)to reduce the latency of CNN inference.A pipelined CNN FPGA structure with flexible parallelism is used to trade off the compute time and the hardware utilization.Neural architecture search(NAS)is used to improve the accuracy by optimizing the architectures of CNNs.The main contributions of this thesis are as follows:(1)A Pipelined CNN FPGA Structure with Flexible Parallelism: The pipeline CNN FPGA accelerator structure is explored in this thesis as a way to reduce the latency of the CNN.Based on the difference in complexity of the target CNNs used,different pipeline hardware structures are implemented.(1)First,a pipelined structure for a network with a convolutional layer and a pooling layer is implemented.This implementation is able to process the pooling layer at the same time as processing the convolutional layer.The hardware implementation is able to process different parameters of CNNs.This implementation can reduce the latency by 37.5% compared to the designs without using a pipeline technology.ASIC is designed for this implementation.(2)For CNNs without much complexity,a fully pipelined layer,and fully parallel channel CNN hardware accelerator structure is implemented.This structure can achieve the lowest latency compared to the state-of-the-art.Timing optimization based on the timing of High Definition Multimedia Interface(HDMI)and quantization methods are used in this design to further reduce the latency and reduce the hardware utilization.Open Coherent Accelerator Processor Interface(Open CAPI)is used to ensure high bandwidth communication with the host processor.Multiple CNN instances are implemented in the system to increase throughput.The Le Net-5implemented in this thesis can achieve a 9.32μs latency and a 1.11 TOPs throughput under a 250 MHz clock with an accuracy of 98.8% on the MNIST dataset.Compared with state-of-the-art Le Net-5 implementations,the latency is reduced by 18.1%,and the throughput is increased by 2.5 times.(3)For CNNs with a higher complexity,a fully pipelined and semi-parallel channel convolutional neural network hardware accelerator structure is implemented.In this structure,the parallelism can be decided based on the available hardware resources of the target FPGA.NAS strategy is able to increase the accuracy of CNNs.The binary Res Net18 with NAS implemented in this design can achieve a 60.5%Top1 accuracy on Image Net.The latency of this implementation varies from 1.12 to6.33 ms,while the throughput varies from 4.56 to 0.71 TOPs under a 200 MHz clock.Compared with state-of-the-art Res Net18 implementations,the latency is reduced by8 times,and the throughput is increased by 1.9 times.(2)Applying NAS to Automated CNN Hardware Design Process: This thesis explores automating the design flow of CNN FPGA implementations,using the CNN FPGA accelerator compiler FINN.First,Le Net-5 models are implemented with the FINN compiler.In addition,this thesis integrates the NAS strategy with FINN to enable its automated implementation.Using this strategy,the implementation not only achieves a higher accuracy but also can achieve a better trade-off between the accuracy and the hardware utilization.Compared with the baseline Res Net18 models,this strategy can achieve up to 3.0% increase on Top1 accuracy and up to 2.2%increase on Top5 accuracy.For Res Net34 models,the increase can be up to 3.1% and2.1% for Top1 and Top5 accuracy,respectively.The best throughput of this implementation is 324.5fps.We used this automated design flow approach in a number of practical application domains and succeeded in achieving high accuracy in practice.The accuracy of the following applications was investigated: satellite image detection achieves 99.3% accuracy,Covid-19 x-ray image detection achieves 95.3%,plant disease detection achieves 94.7%,and flower type detection achieves 96.1%accuracy.(3)Analyzing the Advantages and Disadvantages between Manual and Automated Design Processes: This thesis compares the results of CNN implementations using a manual design process with the implementations using an automated design in the former parts of the thesis.This thesis compares the difference between design time,design flexibility,latency,throughput,accuracy,and hardware utilization.This thesis analyzes the benefits and drawbacks between manual and automated design processes.This can be a good reference for the design choices to be made in different situations in the future.In conclusion,this thesis explores a pipelined and flexible parallelism CNN FPGA structure and integrates the NAS strategy with automated implementation under the design goal of low latency,high accuracy and low hardware utilization.This thesis analyzes the pros and cons between different design processes.This thesis can be a good reference for the CNN FPGA designs in the future.

Keywords/Search Tags:

CNN, FPGA, Pipeline layer and parallel channel, NAS, Automated CNN hardware design

PDF Full Text Request

Related items

1	Parallel Architecture Design And FPGA Verification Of Hardware Adaboost Algorithm
2	Research And Implementation Of The FPGA-based High-speed Routing Lookup Algorithm
3	Design And Implementation Of Hardware Acceleration Architecture Of Physical Layer Protocol Stack Based On FPGA
4	Based On FPGA To Design And Implement The Algorithm Of VGG-16 Neural Network
5	The Parallel Design And Optimization Of H.264Intra Prediction Based On FPGA
6	The Accelerated Implementation Of SIFT Image Matching Algorithm On FPGA
7	Design Of Convolutional Neural Network Acceleration System And FPGA Verification
8	FPGA-Based Multi-Channel Ultra-High Definition Video Real-Time Processing System Design
9	The Research And Implementation Of Deep Learning Heterogeneous Computing Platform Based On CPU And Multiple FPGA Architecture
10	Infrared Moving Target Identification And Tracking System (dsp + Fpga) Hardware Design And Realization