| As the application scope of artificial intelligence technology continues to expand,the scenarios that algorithms need to face are more diverse.In scenes with variable illumination,single-mode object detection has the problem of unstable detection performance.For this problem,performing joint detection of multiple modalities is a common technical solution.However,the improved performance of multi-modal detection causes a dramatic increase in the number of computations and parameters.The cost of performing joint multi-modal detection is too high.And the overly large and complex model structure is not conducive to deployment in embedded platforms such as FPGAs.To address the above issues,this paper researches a small-scale joint infrared and visible target detection model and explores the design of FPGA gas pedals suitable for multi-modal models.The main work is as follows:(1)A local adaptive illumination-driven input-level fusion module(LAIIFusion)is proposed.This module takes input level fusion as a key point and selects easily extractable illumination information as a fusion proposal.The generation of redundant information is reduced while ensuring that infrared and visible information can be properly combined.To solve the problem of incomplete perception of scene illumination information,a local illumination perception module supervised by pixel information statistics is proposed.This module can fully perceive the illumination differences of each region in the image and provide a more suitable reference for fusion.Meanwhile,an offset estimation module consisting of a bottleneck structure is designed to predict the position offset of objects in different modalities and achieve fast alignment between image pairs.The experimental results show that LAIIFusion combined with the single-modal object detection model YOLOv5L has an MR-2 of10.44 and an average detection speed of 31 FPS on the KAIST dataset.Compared with the current representative infrared and visible joint object detection network MBNet,the MR-2 increases by 2.31,but the average detection speed increases by 21FPS.Combining LAIIFusion with other single-mode target detection models,the results show that night MR-2 decreases by more than 50%.The designed input-level fusion module achieves the conversion from single-modal to multi-modal detection model at a lower cost,ensuring real-time performance while improving detection performance.(2)An accelerator for joint infrared and visible object detection based on ZYNQ architecture is designed.The model consists of YOLOv5S and LAIIFusion.On the one hand,reduce the scale of the overall model,including the amount of calculation,parameters and the size of the intermediate feature map,hoping to find a balance between accuracy and scale.On the other hand,quantify model parameters and intermediate feature diagrams to reduce redundancy and dependence on storage resources.At the same time,By adding multiple input and output buffer blocks,as well as data slicing,the goal of multi-channel parallel computing on the PL side is achieved.Moreover,the convolution layer is combined with the batch normalization layer operation to maximize on-chip resource utilization and increase computational parallelism.For the existence of multiple branches in the model structure,a multi-branch structure execution strategy based on data reuse is designed,which can effectively reduce the number of data reads for large-size feature maps.The experimental results show that the designed infrared and visible object detection accelerator has an energy efficiency ratio of 4.53GOP/s/W and a power consumption of only 5.55W on the AX7Z100 platform.The energy efficiency ratio of the accelerator is 6.25 times that of the Intel Core i5-9400F platform and the power consumption is only 8.56%.Although there is still a gap between the energy efficiency ratio of the designed accelerator and the GPU platform,the small size and customizable hardware programming make FPGA still competitive in such applications as assisted driving and unmanned aerial vehicle. |