| In recent years,with the continuous rapid development of artificial intelligence technology and the continuous upgrading of the intelligent industry,depth acquisition,as an important basis for 3D scene perception,has become an important research topic,and plays an important role in autonomous driving,AR/VR,intelligent terminal imaging and many other applications.Through the depth acquisition algorithm,the computer can perceive the distance of the scene,thus providing the basis for subsequent high-level 3D scene perception and understanding.In terms of board categories,depth acquisition can be divided into active-based acquisition,and passive-based acquisition,where the active-based acquisition method actively transmits signals and then uses sensors to receive reflected signals from targets at the receiving end to measure the depth,which has high precision,but is expensive;while the passive method estimate the depth through images captured by the camera with external light sources,it is cheap and has a wide range of applications.Although the current accuracy is not as good as the active method,it has attracted more and more attention.This thesis focuses on two specific tasks of depth acquisition:depth completion and monocular self-supervised depth estimation,and conducts in-depth and systematic research to improve the application value of the algorithm in actual scenarios.First of all,aiming at the problem that the depth acquisition equipment may only obtain sparse depth,the depth completion task is introduced,and a solution is proposed for the application of actual scenes;then,in order to get rid of the limitation of accurate depth data for training and expensive depth acquisition equipment,this thesis focuses on monocular self-supervised depth estimation.With the help of video information,the video self-supervised depth estimation task is introduced,and multiple information is fused to obtain better depth estimation results.Specifically,this paper proposes a temporal and motion enhanced video self-supervised depth estimation by fusing single-frame depth cues with temporal and motion cues in the video;at the same time,by fusing single-frame depth cues and stereo matching information in adjacent frames in video,a two-stream multi-stage hybrid video self-supervised depth estimation is proposed.The main innovative work of this theses is summarized as follows:1.A two-stage depth completion algorithm based on relative depth estimation and scale recovery is proposed.The algorithm addresses the problem of the different distribution pattern of sparse depth under different sensor settings that may occur in the test time.Based on the idea of disentangling geometry structure and absolute scale,the depth completion task is decomposed into two subtasks,namely relative depth estimation and scale recovery.Exhaustive experiments on the public datasets show the effectiveness of the proposed method.2.A multi-frame self-supervised monocular depth estimation algorithm based on temporal and motion enhancement and adversarial metric learning.By fusing the single-frame depth information with the temporal and motion depth information in video,the algorithm designs a temporal enhanced self-supervised depth estimation framework,and proposes an adversarial metric learning training strategy to further improve the depth estimation results,obtaining more accurate and temporal consistent depth estimation results.Specifically,the whole framework consists of a spatial-temporal attention-based aggregation module,a motion feature extraction module,and a motion-guided attention-based refinement module,which improves the accuracy and temporal consistency of depth estimation by aggregating temporal information in videos.Then,the algorithm introduces an analyzer to automatically find the discriminative feature space through the adversarial training strategy,and better enhance the temporal consistency of the depth estimation results.Exhaustive experiments on the public datasets verify the effectiveness of the proposed method.Finally,in order to verify the temporal enhancement of the proposed method,a metric for temporal consistency using reliable optical flow ground truth is further proposed,which is believed to promote the future research in this field.3.A two-stream multi-stage hybrid framework for multi-frame self-supervised monocular depth estimation algorithm is proposed.By fusing the single-frame depth information with the stereo matching information of adjacent frames in videos,the algorithm designs a two-stream multi-stage hybrid decoder,which effectively combines single-frame based scene information and multi-frame based stereo matching information,obtaining more accuracy depth estimation results.The algorithm uses a plane-sweep based cost volume to explicitly represent the matching information.To better fuse these two kinds of information,the method designs a multi-stage interactive two-stream framework guided by the depth priors of the scene.Furthermore,the method uses the distillation learning method with the motion object mask for better performance.Finally,exhaustive experiments on the public datasets verify the effectiveness of the proposed method.At the same time,the analysis of the computational cost and the limitations of the method is conducted,and the future work is prospected. |