| The monocular depth estimation task aims to estimate depth information from monocular images of general scenes.But this task is a typical ill-posed problem because a captured image may correspond to many natural settings.The mainstream depth estimation algorithms can be mainly divided into supervised depth estimation methods and unsupervised depth estimation methods.The unsupervised depth estimation method has become a current research hotspot because it avoids laborious annotation and additional calibration compared to the former.There are common problems with unsupervised depth estimation methods,such as semantic distortion,unclear boundaries,and projection distortion caused by moving objects.This paper improves the performance of unsupervised depth estimation by fusing different cues.The main innovations are as follows:(1)An unsupervised depth estimation model based on semantic segmentation is proposed.This model improves the classic unsupervised depth estimation framework while solving the above problems.The model utilizes the pre-trained semantic segmentation model as prior information.It is regarded as the guide for the generation of deep features.It enhances the semantic knowledge of the deep network,effectively alleviating the problems of semantic distortion unclear boundaries in the depth estimation task.Plus,the motion of moving objects is constrained by semantic labels.This setting is to solve the projection distortion caused by moving objects.(2)An unsupervised depth estimation model based on spatial context is proposed.The model mentioned above relies too much on the pre-trained model.On the contrary,This model adaptively captures global context information and local context information pixel by pixel.Specifically,an attention map is built to measure global contextual demands through the similarity between global and local features.Its inverse value can be used to measure local context requirements.On the other hand,the information features are selectively emphasized by re-calibrating the channel-based feature map.Then each pixel is filtered according to the characteristics of its surrounding points through the co-occurrence network layer to enhance its edge details information.Coarse-to-fine results are obtained by building multiple adaptive context modules in different network layers.To evaluate the effectiveness of the proposed models,we conduct extensive experiments on commonly used datasets.The experimental results show that the above models are superior to mainstream methods. |