| Depth is an important sensor data,which expresses the distance from each pixel of the scene to the sensor.It can provide three-dimensional information for many computer vision tasks,such as object detection,pose estimation,semantic segmentation and so on.Therefore,depth estimation is a very important topic in the field of computer vision.Previous depth estimation algorithms are mainly based on binocular,multi-camera,structured light and laser solutions,and there is less discussion on monocular depth estimation.In recent years,the monocular depth estimation model based on deep learning has gradually approached the binocular solution in reliability and accuracy,and obtained a comparative advantage in cost.However,because monocular depth estimation is an illposed problem,the algorithm is often accompanied by a huge amount of parameters and computation,which is difficult to be applied in embedded devices.In order to improve the computing speed of monocular depth estimation model,some scholars focus on network light-weight and use the minimalist network structure to achieve real-time or beyond real-time running speed.However,these lightweight methods usually have poor performance.The performance of a monocular depth estimate usually depends on the amount of computation and parameters.This results in a huge performance gap between lightweight and non-lightweight models,limiting their application in the real world.This paper models the main accuracy gap between them as differences in depth distribution,which is called"distribution drift" of depth.The depth distribution counts the number of pixels of the depth value in different value ranges,and expresses the depth change nature of the scene.We decompose the distribution drift problem into depth distribution shape deviation and scene depth range deviation,and propose depth distribution alignment network(DANet).We firstly design a pyramid scene transformer(PST)module to capture inter-region interaction in multiple scales.By perceiving the difference of depth features between every two regions,DANet tends to predict a reasonable scene structure,which fits the shape of distribution to ground truth.Then,we propose a local-global optimization(LGO)scheme to realize the supervision of global range of scene depth.Thanks to the alignment of depth distribution shape and scene depth range,DANet effectively alleviates the distribution drift,and achieves a comparable performance with prior heavy-weight methods,but uses only 1%floating-point operations per second(FLOPs)of them.Finally,in order to verify the contribution of this paper,this paper conducts qualitative and quantitative experiments on the widely used NYUD v2 dataset,as well as the iBims-1 dataset to verify the generalization of our method,and finally through the joint training of multiple datasets,to further improve the performance and generalization of the proposed model. |