| Currently,single-sensor perception tasks have limitations.To improve the accuracy of perception tasks,many works have adopted multi-sensor fusion methods.With the rapid development of 3D scene perception technology,multimodal fusion has been widely applied in 3D object detection.However,the current multi-sensor fusion methods have the following problems: the utilization efficiency of multi-sensor information is low,making it difficult to effectively address the detection problems of low imaging resolution and partially occluded objects in complex scenes.In addition,multimodal fusion needs to consider data from multiple sensors,making the algorithms less robust and easily affected by factors such as sensor failure and data loss.Therefore,current multi-sensor fusion methods still need further improvement to enhance the accuracy and robustness of perception tasks.In deep learning technologies,attention mechanisms improve the ability of deep networks to represent,analyze,and understand data by adaptively selecting and weighting different features.This thesis mainly investigates the impact of attention mechanisms on the detection results in the target detection task of point cloud and image information fusion,and verifies the effectiveness of the proposed algorithm.The specific work content is as follows:1)Research on single-modal detection algorithms based on attention mechanisms.In this part,single-modal data is used to perform object detection tasks.The experiment adopts an encoder-decoder structure and inserts a local-global attention mechanism module between the encoder and decoder to obtain richer global context information.The local-global attention mechanism module consists of a local module,a global attention mechanism module,and a skip connection structure.The experimental results show that the proposed detection algorithm can effectively improve the detection effect under both image and point cloud single-modal scenarios.2)Research on multimodal detection algorithms based on attention mechanisms.In this part,the Nu Scense dataset containing data from multiple sensors is used.Based on the research foundation of the first part,the experiment first adopts an initialization target query mechanism,using the extracted image features as guiding information to obtain the Query.Then,a cross-attention mechanism is added to the encoder-decoder structure to fuse image features and point cloud features.Finally,candidate boxes are predicted and target detection results are output through two decoder layers.The cross-attention mechanism module separately computes the fusion features of the point cloud features and image feature queries.A feedforward neural network and supervision are added after each decoder layer in the detection head,using predicted candidate boxes to constrain cross-attention.The crossattention mechanism can model the relationships between feature maps from different sensors,making full use of semantic information in the feature maps.Experimental results show that in the same scene,the multimodal object detection metric m AP increased by 4.1% compared with single-modal;in different scenarios,the multimodal fusion method proposed in this thesis exhibits more stable network detection performance. |