| Image semantic segmentation refers to classifying each pixel in an image and assigning it a category label to associate it with a real object or concept.At the same time,disparity estimation refers finding out the matching between two perspectives of all or part of the pixels for binocular image,and finding its offset.Disparity can be easily converted into depth data to build a local 3D model of the scene.We refers to the combination of the semantic segmentation and disparity estimation as 2.5D semantic segmentation.2.5D semantic segmentation can be applied to many fields,such as scene perception of autonomous navigation robots,sensor positioning in augmented reality,and automatic analysis of radar images in defense security.With the introduction of the Deep Convolution Neural Network(DCNN),the performance of semantic segmentation and disparity estimation algorithms has achieved a major breakthrough.Based on the classification function of the original DCNN,researchers gradually study its application in image semantic segmentation and disparity estimation.In recent years,DCNN-based semantic segmentation networks use encoder-decoder as a structural basis to perform feature extraction and semantic detail restoration of images.The disparity estimation task has undergone a transition from encoder-decoder structures which focus on feature extraction and disparity detail restoration to Siamese network which focus on feature extraction and feature matching.As the network structure becomes more and more complex and efficient,datasets have also been proposed in large numbers to advance related work.We investigates and analyzes the existing scene understanding datasets,and proposes a large-scale binocular indoor scene understanding dataset.At the same time end-to-end2.5D semantic segmentation network is proposed based on the dataset.We conduct in-depth research from these aspects: data set generation,network model construction,and training strategies:(1)Generation of large-scale binocular scene understanding data sets: Based on the currently published three-dimensional scene model data set,we take the actual situation of robot movement into account to guide the robot to perform global navigation in the scene.In this paper,the navigation results are used to filter and sort scenes,and sequentially perform image rendering based on ray tracing.As a result,the path planning is performed for 5,414 scenes,and binocular RGB data from including 222,778 pose of 312 scenes,as well as binocular semantic segmentation and depth truth labels are rendered.Through statistics,the pixel distribution of semantic segmentation is consistent with the real dataset,ensuring the rationality of the data set.(2)Design and Implementation of 2.5D Semantic Segmentation Network: Based on the design advantages of the current state-of-the-art semantic segmentation and disparity estimation networks,we proposes an end-to-end 2.5D semantic segmentation network that can simultaneously output the results of semantic segmentation and disparity estimation.The network is divided into three parts: feature extractor,semantic segmentation branch and disparity estimation branch.The feature extractor is constructed based on the residual unit in the residual network(Residual Networks,Res Net).The binocular input image is processed by the feature extractor to form a Siamese network structure,in order to get the binocular features of left and right images at different resolutions.Binocular features are processed by spatial pyramid pooling and cost volume respectively to obtain multi-scale semantic segmentation features and disparity estimation features containing depth information.These two type of features are further processed by the semantic segmentation branch and the disparity estimation branch,respectively,and simultaneously output the semantic segmentation and disparity estimation results.Afterwards,we aimed at the multi-tasking nature of 2.5D semantic segmentation network,and determined the optimal training strategy,and further introduced the multi-objective loss function to improve the network performance.Through a large number of ablation experiments,we confirm that the proposed dataset can provide effective image information for the training of semantic segmentation tasks and the optimal training strategy was determined.At the same time,through experiments,we show that the proposed multi-objective loss function effectively improves the metric performance of semantic segmentation,reaching 89.012%,and the error metric in disparity estimation reaches 1.21 pixels.At the same time,experiments show that the multi-objective function applied in this paper enables disparity estimation to be constrained by semantic supervision and provides research directions for further research. |