| Stereo vision can reconstruct three-dimensional environment,and has high flexibility and low cost in obtaining depth information,which is widely used in autonomous driving,virtual reality,robot navigation,and non-contact measurement.Stereo matching is the critical step in binocular stereo vision,and its purpose is to calculate the disparity values of matching pixels in the rectified left and right images.In recent years,with the development of artificial intelligence technology,deep learning based stereo matching algorithms have shown more robust performance than traditional methods,but there are still some problems in the real complex scenes to be further solved.This thesis focuses on three problems in stereo matching.First of all,the existing stereo matching is difficult in the ill-posed regions such as occlusion,illumination changes and weak texture,which affects the improvement of the overall disparity accuracy.Secondly,3D cost volume based algorithms assume that the disparity probability distribution is unimodal,and the final output is obtained by the sum of candidate disparity weighted by the predicted probabilities.However,in real scenes,the obtained probability distributions are often multimodal,which will lead to the generation of disparity outliers.Thirdly,the widely used disparity-based loss focuses more on the nearby regions with large disparities,resulting in the weak performance of distant disparity estimation,which adversely affects the overall disparity accuracy and subsequent object detection or other downstream applications.The main work and contributions are as follows:(1)For the problem of matching ambiguity in ill-posed regions,we introduce high-level panoptic parsing to guide the estimation of disparity.We combine semantic and instance segmentation with the stereo matching branch and propose confidence,disparity residual and loss modules to optimize disparities from the perspective of panoptic features,categories,and geometric structures.Considering the consistency of semantic and instance features between the left and right matching pixels,the confidence module calculates the correlation of semantic and instance features under different candidate disparities to adjust the probability distribution in the cost volume.In the disparity residual module,the disparity map is divided into multiple channels according to semantic and instance categories.Then semantic-and instance-related residuals are obtained by depthwise separable convolutions.Finally,the loss module further generates panoptic parsing guided boundary and smooth loss to supervise the networks considering the geometric similarity between panoptic and disparity maps.The whole model is tested and analyzed on multiple public datasets,among which the Virtual KITTI dataset significantly improves the matching quality of areas with occlusion,weak texture and boundaries,verifying the effectiveness of introduced panoptic information.(2)For the multimodal problem of disparity probability distribution in the prediction stage,an unbiased unimodal cost volume based stereo matching network is proposed in this thesis.We first analyze the different types of disparity probability distributions,in which the multimodal and deviated unimodal distributions can cause large errors.To solve this problem,we design the network structure from three perspectives.Sufficient features can help the network better learn the unimodal distribution.Based on general 3D convolutions,we introduce 2D convolutions along the disparity dimension to obtain global disparity features.Then,we propose a self-supervised loss to encourage the unimodal distribution from multimodal ones.Finally,the iterative refinement structure predicts the probabilities and disparity offsets to finetune the deviated unimodal distribution.Experiments have shown that the unbiased unimodal distribution can effectively improve the performance of disparity estimation.In supervised training mode,the entire network significantly reduces outlier rates by 23% and 19% in the Scene Flow and KITTI Stereo 2015 datasets.And in unsupervised training mode,this network also outperforms other unsupervised algorithms in most metrics.(3)For the problem that traditional disparity loss overemphasizes nearby pixels,we propose a normalized disparity loss,which can be embedded in most stereo matching networks to improve the estimation accuracy of disparity at faraway regions and the overall matching performance.We observe that the disparity loss gets higher as the groundtruth increases,which makes the faraway regions can not be effectively trained with small losses.We introduce a cost function to simulate this trend and realize normalization by dividing between these two items,making the network be trained more evenly on different disparities.In addition,due to the statistical error in obtaining the cost function,we limit its range to avoid potentially abnormal data.We conducted extensive experiments on multiple publicly available stereo matching datasets and different baseline models.The results demonstrate that making minor modifications to the loss function can lead to improved disparity estimation performance in distant regions beyond10 m in the Scene Flow,Driving Stereo,and KITTI Stereo datasets,while maintaining a slightly increased or unchanged error rate in nearby regions.Moreover,we apply our method to the3 D object detection task,which improves the precision of distant object detection,indicating a great value for the downstream tasks of stereo matching. |