Semantic segmentation is one of the key technologies for the semantic understanding of road scenes for intelligent vehicles,and it is also an important research topic in the field of computer vision and even artificial intelligence.The goal is to perform comprehensive and refined semantic recognition of scenes.Early semantic segmentation methods were often based on artificially designed features and shallow machine learning models,making it difficult to adapt to complex environments.In recent years,deep learning has made major breakthroughs in the application of many artificial intelligence fields.Semantic segmentation based on deep learning has also become one of the important research directions.This paper focuses on two difficult issues in deep learning-based semantic segmentation: heterogeneous and isomerous data fusion and modeling and learning of highly distorted images.The main research results are as follows:(1)Aiming at the problem of modeling interdependencies between the feature extraction of RGB and depth data,an RGB-D semantic segmentation method based on interactive fusion is proposed.First,a bottom-up interactive fusion network structure is proposed.The structure introduces an interaction stream to connect the RGB stream and the depth stream.The interaction stream not only aggregates features from the RGB stream and the depth stream,but also computes complementary information for the modality-specific data streams.Then,a residual fusion block was proposed to instantiate this network structure,and an RGB-D semantic segmentation model named RFBNet was created.Finally,experiments on the indoor dataset Scan Net and the outdoor dataset Cityscapes verify that the RFBNet achieves the state-of-the-art performance.(2)Aiming at the problem of isomerous image and point cloud data fusion,a three-dimensional scene semantic segmentation method based on superpoint pooling is proposed.In order to make full use of the visual information of the image and the geometric information of the point cloud,the two-dimensional semantic segmentation network is used to extract twodimensional visual features from the image,and the three-dimensional semantic segmentation network is used to extract the three-dimensional geometric features from the point cloud.In order to combine isomerous image visual features and point cloud geometric features,a joint learning method of superpoint pooling is proposed: using superpoints as intermediate representations to merge visual features and geometric features,and then perform joint feature extraction.This method uses superpoints to connect isomerous features,avoids the problems of excessive memory and quantization errors that often occur in voxelization methods,and can handle large-scale point cloud scenes.Experiments verify that the proposed method can effectively utilize the complementary advantages of geometric and visual information,and has a significant performance improvement over the early fusion and late fusion methods.(3)Aiming at the problems of large distortion and difficult modeling of surround-view images,a network based on restricted deformable convolution is proposed,which improves the network’s ability to model geometric deformation by learning the position offsets of the sampling points of the convolution kernel.Aiming at the problem of network learning caused by insufficient training set of surround view images,a training method based on multi-task learning is proposed.This method models training on small-scale real world image datasets and large-scale conventional image datasets with different category spaces as a multi-task learning process to improve the generalization performance of the model.In order to reduce the difference in image distortion between the two datasets,a zoom augmentation method is proposed to transform the conventional image into a surround-view-style image.In order to reduce the impact of the domain shift of the two data sets on training,a multi-task learning method based on Ada BN is proposed.In order to balance the loss weights between different tasks,a hybrid loss weight method is proposed to further improve the generalization performance of the model.Experiments verify that the proposed method can effectively handle highly distorted surround view images,and finally realize 360?semantic segmentation of road scenes.(4)Based on the semantic segmentation results of surround view images,a lane-level localization method based on semantic segmentation is proposed.First,a method for detecting road features(including road boundaries and road markings)based on pixel-level semantic segmentation is proposed.This method can distinguish real and non-real road boundaries,exclude dynamic targets from the localization process,and can extract linear and non-linear road markings at the same time,helping to improve lateral and longitudinal localization accuracy.Considering the large difference in confidence between these two road features,a coarse-scale localization method and a fine-scale localization method are proposed to solve the problem of high-precision localization based on surround-view cameras.The coarse-scale localization method matches the road boundaris with the map to obtain a rough vehicle position,and provides initialization information for fine-scale localization.The fine-scale localization method matches the road markings with the map to obtain the final vehicle position.Experiments show that the method proposed in this paper can robustly detect different types of road boundaries and road markings,and achieve centimeter-level localization accuracy in urban environments. |