| Estimating depth in images is a critical computer vision task with many applications in upstream tasks such as scene understanding for autonomous driving,virtual reality,and robot navigation.In contrast to binocular depth estimation algorithms,monocular depth estimation algorithms offer a cost-effective alternative as they do not require specialized equipment or instruments.Traditional monocular depth estimation algorithms rely on cues like shadows and textures,which are susceptible to environmental disturbances and lack generalizability.In contrast,deep learning-based monocular depth estimation algorithms recover scene depth information using cues such as 3D geometric structure of objects in images,requiring fewer environmental conditions and offering higher research and application values.At present,monocular depth estimation algorithms based on deep learning encounter challenges such as blurred object boundaries,loss of thin structure objects,and low feature utilization.To tackle these issues,this study explores the monocular depth estimation algorithm using deep learning methods.The paper provides a detailed account of the research conducted.(1)This study introduces a novel monocular depth estimation algorithm that incorporates a spatial pyramid and a hybrid attention mechanism.The proposed algorithm comprises three main modules: a feature extraction module,a spatial pyramid module,and a hybrid attention mechanism module.The backbone network is based on Res Ne Xt,which integrates the "separation-transformation-aggregation" concept to improve model accuracy without increasing complexity.To enable shallow features to contain feature information at multiple scales,a spatial pyramid module is introduced before the fusion of shallow and deep features in the jump connection.The decoder side is composed of a hybrid attention mechanism module,which includes a spatial attention mechanism and a channel attention mechanism.By learning dynamically,the weight values are adjusted,enhancing the target object regions and weakening the irrelevant background regions.Our experimental results demonstrate that the proposed algorithm yields improved accuracy in monocular depth estimation(2)In this paper,we propose a novel monocular video depth estimation algorithm based on the Swin Transformer architecture.Although convolutional neural networks can perform local inference and use information in the local area,their limited perceptual field may not be sufficient for accurately estimating the depth of a particular object.Using Swin-Transformer as the backbone network makes the network structure not limited to local inference but supports global modeling.In addition,a bit-pose estimation network is incorporated,which takes two consecutive frames as input and produces the translation-rotation transformation matrix of the capturing view between the two frames.Building upon the work of Lyu et al.,we introduce multiscale connectivity into the decoder of our deep neural network architecture to effectively recover fine-grained details that may be lost during the downsampling process.By leveraging the features extracted at each hierarchical level of the network,our model achieves superior performance compared to other state-of-the-art algorithms in the challenging task of monocular depth estimation. |