In recent years,with the yearly increase in vehicle ownership,the road system is facing greater challenges in terms of traffic efficiency and traffic safety.The rise of artificial intelligence technology has provided a new direction for traditional vehicle driving to explore,and autonomous driving technology,as a safe and efficient solution to alleviate traffic congestion and reduce traffic accidents,offers the possibility to effectively solve the above problems.Among the many functional modules of autonomous driving,intelligent vehicle decision-making is the bridge between environmental perception and planning control,equivalent to the "human brain".Intelligent vehicle decision-making based on the end-to-end learns the decision signal directly from the image information.To address the problems that the existing end-toend decision-making methods have little mining of image information in front of the vehicle and insufficient feature extraction capability of the decision model,this paper studies an end-to-end autonomous driving decision-making method that combines spatial and temporal features and visual attention with the original image and depth image information as input and steering wheel angle and speed value as output.The main research components are as follows:(1)Three intelligent vehicle longitudinal and transverse end-to-end autonomous driving decision models are constructed based on deep convolutional neural networks.Since the conventional Pilot Net convolutional model has a simple network layer design and shallow layers with insufficient feature extraction capability,three end-to-end decision models VGG16-DNet,Inception V3-DNet,and Res Net50-DNet are constructed based on the idea of transfer learning using the convolutional layers of VGG16,Inception V3,and Res Net50,and selected two metrics MAE and TRE to evaluate the driving decision models offline.Two publicly available real-world driving datasets,Udacity and Comma2k19,are used to conduct research experiments,and the results show that deep convolution works better for feature extraction of complex road scenes,and the Res Net50-DNet end-to-end decision model constructed from residual networks performs optimally.In addition,the visualization of shallow convolutional features is investigated in order to understand the learning process of the decision model intuitively.(2)Three fusion methods of the original RGB images and the depth images are investigated.Since the end-to-end autonomous driving decision model completely relies on the RGB images in front of the vehicle,in order to fully exploit the road environment information contained in the original RGB images and consider the relative distance of each point in the vehicle forward direction in the road scene in front of the vehicle,a monocular visual depth estimation algorithm based on deep learning is used to extract additional depth images from the original images.Taking the original RGB image and the depth image as the model inputs,three fusion methods of the original image and the depth image at the pixel level,feature level,and decision level are studied.The experimental results show that the end-to-end autonomous driving decision model with original RGB images and depth images as input is more accurate than that with only original RGB images or depth images as input,and the performance of the decision model with the fusion of original RGB images and depth images at feature level is better than that with fusion at pixel level and decision level.(3)An end-to-end autonomous driving decision model fusing Spatio-temporal features and visual attention is proposed.Obtaining past environment and vehicle state information helps the end-to-end decision model to make better decisions,and two end-to-end decision frameworks based on the fusion of Spatio-temporal features,single-stream and dual-stream,are constructed by extracting spatial and temporal features using convolutional networks and LSTM networks,respectively.In addition,visual attention can help the end-to-end decision model capture important information about the road environment in front of the vehicle.Two visual attention modules,SENet and Non-Local Net,are introduced based on transfer learning,and two end-toend decision frameworks based on the fusion of Spatio-temporal features and visual attention are also constructed for single-stream and dual-stream.Experimental results show that both extracting temporal features and introducing visual attention can reduce the error of driving decisions and effectively improve the performance of the end-toend decision model,and the dual-stream Res Net50-SENet-LSTM decision model is the optimal model.Simulation tests and online evaluation based on the Carla simulator show that the end-to-end decision model is able to accomplish autonomous driving,and the end-to-end autonomous driving decision model with the fusion of depth information,timing information,and visual attention has better driving performance and the number of dangerous decisions is significantly reduced. |