Font Size: a A A

Research On Monocular Depth Estimation Method Based On Unsupervised Deep Learning

Posted on:2024-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:H Y HuFull Text:PDF
GTID:2568307097962979Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Depth estimation is one of the main research topics in the field of computer vision.Depth information is crucial for scene understanding.There are many applications in the fields of autonomous driving,3D reconstruction,and virtual reality.Traditional methods are mainly based on the geometric relationship between binocular or multiocular images for stereo matching and using triangulation measurements to infer depth information,but they are scene limiting and computationally intensive.In real life,monocular cameras have a much wider range of scene applicability,so the research on depth estimation of monocular images is closer to the real situation.However,monocular depth estimation is an abnormal problem because there are countless real scenes that can be matched with a single RGB image,yet lack reliable constraints to limit these possibilities.With the development of deep learning,researchers have used convolutional neural networks to achieve pixel-level depth estimation of monocular images.Depending on the training method,it can be divided into supervised learning and unsupervised learning.Supervised learning aims to optimize the depth value predicted by the model to be close to the labeled true value by using the depth truth value labeled in advance as a supervised signal.However,the acquisition of depth truth is costly and difficult and pixel-by-pixel labeling requires a lot of human resources,so unsupervised monocular depth estimation has attracted more attention.In this paper,we propose an unsupervised monocular depth estimation method with the goal of improving the accuracy of estimation.The main work includes the following two aspects:1)An unsupervised monocular depth estimation method based on attention mechanism and depth-pose consistency loss is proposed to tackle the problems of local detail loss,scale ambiguity inherent in monocular system and deerease in accuracy due to illness regions in complex scenes.In this method,a pyramid channel attention module is designed and embedded in the feature extraction stage to overcome the limitations of convolutional neural networks due to limited receptive fields and cross-channel irrelevance.By simultaneously employing convolution kernels of different scales to capture multi-scale information and giving more weight to the feature map’s more helpful channels,which successfully recovers more local details.Meanwhile,the depth-pose consistency loss is proposed based on the geometric consistency of depth and pose as a supervisory signal to constrain the scale between samples,which effectively eliminates the scale ambiguity.Additionally,the cover mask derived from the depth-pose consistency loss filters dynamic objects and outliers to reduce the impact of illness regions,further improve the performance.Through extensive experiments on indoor and outdoor datasets,based on public evaluation metrics compared with other advanced methods,the results show that the proposed method achieves state-of-the-art performance on all benchmarks.2)A self-supervised monocular depth estimation method with joint semantic segmentation is proposed for the mutually beneficial relationship between depth estimation and semantic segmentation to further improve the depth estimation performance by semantic information.In this method,the shared encoder for semantic segmentation and depth estimation is implemented to achieve semantic guidance.To further improve the across multiple tasks performance of the encoder,a multi-task feature extraction module is designed,including two operations of grouped feature mapping and multi-scale feature attention fusion.Specifically,first groups the feature map,and then aggregates the multi-scale feature context after local and global feature extraction refinement in the attention fusion module.Meanwhile,a cross-task interaction module is proposed to achieve cross-domain information interaction,and since it is a unidirectional data flow,the reference features are used to refine the target features,in order to achieve bidirectional enhancement,two of cross-task interaction module are embedded in the decoder to further enhance the monocular depth estimation performance,especially in weak texture regions and object boundaries with limited supervision of photometric consistency.Finally,a comprehensive evaluation on the KITTI dataset demonstrates that the proposed method provides significant improvements in accuracy and outperforms other methods.
Keywords/Search Tags:Unsupervised deep learning, monocular depth estimation, attentional mechanisms, semantic segmentation
PDF Full Text Request
Related items