| With the development of technology and society,computer vision technology has been widely used in many fields.Object classification and detection,as two basic problems in computer vision research,are basis of scene recognition,action recognition,face recognition and other visual tasks.Through analyzing the existing object classification and detection technology,we find that the method based on RGB image lacks the depth information of actual three-dimensional scene due to the inherent limitations of two-dimensional data.It is vulnerable to illumination,the changes in object scale and other factors.Although the method based on RGB-D image makes up for the shortcoming of the two-dimensional recognition method,it needs depth sensor to collect real depth information in testing.The limited measure distance of sensor also limits the application of RGB-D based method.In this paper,the depth information obtained by monocular depth estimation is introduced into the existing object classification and detection model to improve the recognition performance of existing methods.The main research work includes:(1)An object classification and detection method based on monocular depth estimation is proposed,which makes the model only need RGB image input in testing,and can introduce depth information without depth sensor,thus improving the recognition performance of existing classification and detection algorithms.(2)To overcome the shortcoming of existing monocular depth estimation algorithms in detail reconstruction,a depth estimation model based on feature pyramid network is proposed.Pixel shuffle module is used to enhance the ability of feature extraction in the upsampling process.Residual pooling module is used to make the network make full use of the context information.This paper also implements the loss function based on three geometric meanings: depth,gradient and surface normals.Experiments on public datasets demonstrate that the proposed method can obtain good depth estimation results with fewer parameters and higher running speed,and has better reconstruction result on the structural details of the scene,which is more conducive to the performance improvement of subsequent classification and detection tasks.(3)The depth estimation model is introduced into the existing object classification and detection model.The better fusion strategies such as network initialization method and fusion location selection are explored.Both the depth estimation model and recognition model are optimized jointly by using the multi-task learning method,so that the depth estimation model can receive the semantic guidance of recognition tasks in training time and generate clear depth maps which are conducive to classification and detection tasks.The comparative experiments on public datasets show that the proposed method can effectively improve the performance of the classification and detection model.The performance of the proposed method is explored on public datasets which do not contain real depth maps.The experiments show that the method achieves good generalization on natural scenes. |