Autonomous vehicles will interact with traffic participants in complex traffic environments.Sensing and understanding the location and state of static and dynamic traffic elements is the key to ensure safe driving of autonomous vehicles.Combined with the research content of National Natural Science Foundation of China(52072054),this thesis focuses on the research of environmental cognition and scene understanding methods based on multi-sensor information fusion in complex autonomous driving environments.The main research work is as follows.(1)Aiming at the problem of scene understanding relying on object location.Based on Frustum PointNet structure,a 3D object detection algorithm based on cascade YOLOv7 is proposed by fusing RGB image and point cloud data of autonomous driving surrounding scenes to obtain high precision object position.First,a frustum estimation model based on YOLOv7 is constructed to expand the RGB image RoI into 3D space.Then the object point cloud and background point cloud in the frustum are segmented by PointNet++.Finally,the natural position relationship between objects is explained by using the non-modal 3D box estimation network to output the 3D information of objects.The advantages of the model are verified by analysis and ablation experiments in KITTI.A 3D multi-target tracking algorithm is established based on the 3D detection results.The real-time multi-target tracking are achieved by combining the 3D Kalman Filter and Hungarian Algorithm to make up for the ID switching of the detection system in the case of continuous frames with multiple objects.In addition,the validity of the tracking model in this thesis is verified combined with 3D MOT evaluation index.(2)Aiming at the problem of the lack of semantic information when understanding scenes,a scene semantic completion model is built to infer dense geometric and semantic information of driving scenes from images.Firstly,the EfficientNet-B7 is used to extract image features.Then the voxels are reversely projected to the location of multiscale 2D features through the Feature Line of Sight Projection(FLOSP)feature line for sampling,and the 3D feature are extracted by 3D UNet.A 3D context prior relation layer is introduced between the 3D UNet encoder and decoder to correct differences of spatial semantic information.The scene semantic completion results are output through the Atrous Spatial Pyramid Pooling(ASPP)and softmax layers.Finally,the task of scene understanding is finished by integrating the results of 3D object detection,tracking and scene semantic completion models.(3)To verify the validity of the scene understanding model,the multiple scene data are collected and data sets are made based on ASEva and RS View WireShark.First,the camera and lidar sensors are calibrated online through ASEva.RS view is used for realtime visualization of point clouds when collecting the scene data sets on the road,and Wire Shark is used to capture the data information transmitted in the background in time.The offline files which can be used to store point cloud information and video are extracted,and the effectiveness of the scene understanding model is finally verified by the experiment,which proves the applicability and generalization of the model. |