| Scene depth estimation is a basic and key topic in the field of computer vision.As the bottom task of three-dimensional space vision,it provides the most basic depth information for the research of virtual reality,robot obstacle avoidance,robot-assisted surgery,etc.,and depth information plays an important role in three-dimensional reconstruction,automatic driving and scene perception.Therefore,the method of obtaining depth information has become the focus of people’s research.The traditional depth estimation method using binocular or multi-purpose is affected by complex environment,resulting in high computational complexity.In contrast,the depth estimation of monocular image has obvious advantages,mainly convenient deployment and low computational cost.It is essential to regression the depth of 3D space in the monocular depth estimation method,because there is a lack of reliable stereoscopic relationship,in which the image adopts 2D form to reflect the 3D space,and has low requirements on the environment and equipment,and is closer to the needs of practical applications.Based on this,the monocular scene depth estimation method based on convolutional neural network has gradually gained attention.With the deepening of the research,people gradually found that compared with convolutional neural network,graph convolutional neural network can process non-Euclidean data and obtain more correlation information in the scene.Therefore,the realization of monocular scene depth estimation based on graph convolutional neural network has become a research focus.However,there are still many problems in the monomial scene depth estimation: the existing indoor scene depth estimation methods do not consider the influence of spatial context information and semantic information on the depth estimation,which is easy to make the object boundary fuzzy when obtaining the depth map,which leads to the problem of poor structure consistency.Some outdoor depth estimation methods do not consider the complexity of the real scene,which leads to inaccurate prediction of details and loss of local details.(1)Aiming at the problem that the existing indoor scene depth estimation methods do not consider the influence of spatial context information and semantic information on depth estimation at the same time,which is easy to blur the object boundary when obtaining the depth map,resulting in poor structural consistency,this thesis proposes a graph convolution network model based on the combination of depth information and semantic information.Firstly,the initial depth map module obtains global association information between features through multi-scale feature fusion and attention mechanism.Multi-scale feature fusion can integrate geometric features of different scales,and learn feature similarity between pixels through attention mechanism to obtain more global context information of scenes.The semantic information obtained in the semantic guidance module can encourage the network to better capture the geometry of the scene,so as to obtain more context information and location information,and then obtain the semantic result feature map.Then,the initial depth Feature map and semantic result feature map are spliced to obtain a feature map with richer information.The useful information that can be used for each other’s tasks is utilized to obtain the similarity between corresponding nodes through calculation.The semantic regions of nodes with similar depth should be consistent.By constructing the adjacency matrix,the graph structure is constructed,and the spatial context information of geometric features is extracted more accurately by using the graph convolution,so as to realize the monocular image depth estimation reasoning,and then the high-precision scene depth map is obtained.(2)The existing self-supervised depth estimation methods do not consider the complexity of realistic scenes,which leads to the problem of inaccurate prediction of details and loss of local details.The depth estimation network uses a geometric DL network to help extract objectbased location features and maintain the relationship between nodes in the depth map by generating a multi-scale depth topology.The depth map is obtained by using a multi-scale GCN.The attitude estimation network is a regression network with encoder and decoder parts.The attitude encoder receives a pair of connected images,the source image and the target image,and the output of the attitude estimation network is the relative attitude between the source image and the target image.These two main networks provide geometric information to provide point-to-point correspondence in the reconstructed image.Both the estimated 3D attitude and depth maps will be used to construct the target image,incorporating loss functions related to photometric reprojection and smoothness to deal with poor depth prediction.After training and testing with the KITTI data set,the results show that the depth map obtained by the method in this thesis is more accurate in local details of objects,and can help recover more details. |