| The detection of the 3D pose of objects is a fundamental problem in the field of computer vision for 3D perception.It plays a crucial role in areas such as augmented reality and robot control.With the widespread use of intelligent mobile devices,the problem of 3D object pose detection based on monocular color cameras has become a research hotspot.The goal is to estimate the pose of each object in the RGB image relative to the camera.Currently,the majority of methods decouple the pose detection process into two modules:object detection and pose estimation.These methods utilize deep convolutional neural networks to first detect the object class and bounding box in the RGB input image via object detection.Then,the networks predict the correspondences between 2D points within the bounding box and 3D points on the object model.Finally,the pose estimation is performed based on the 2D-3D point correspondences.While the performance of 3D object pose estimation has been constantly improving in recent years,achieving fast and accurate pose detection remains a challenging task,fraught with numerous unresolved issues.These problems arise primarily from the fact that deep learning methods depend on a large amount of manually annotated image data for learning,leading to high annotation costs.Moreover,the objects themselves present challenges,such as surfaces lacking distinct texture features and structures with symmetrical similarities.Finally,there are environmental factors to consider,including cluttered backgrounds,partial occlusion,and similar colors.All of these issues can easily lead to pose detection failures.This paper investigates the challenge of fast and accurate detection of 3D object poses in complex scenes,using RGB images as the basis.To tackle this problem,a large number of synthetic images are generated based on 3D object models to provide cost-effective training data for the network.Moreover,the 3D object pose detection problem is decoupled into two modules:object detection and pose estimation.This paper employs transfer learning to enhance the generalization ability of the object detection module in cross-domain environments.To reduce the number of model parameters,a lightweight encoder-decoder structured network,MLP-ResUnet(Multilayer Perceptron Residual U-Net),is proposed.To enhance the robustness of the algorithm for textureless objects,dense point correspondences are predicted.Furthermore,the mask constraint is incorporated into the PnP-RANSAC(Perspective-n-Point RANSAC)framework,resulting in the Mask-constrained PnP-RANSAC(MaskSAC)approach.This method improves the accuracy of pose estimation when the inlier ratio is low,especially in situations where the background is cluttered and objects are heavily occluded.The specifics of this research are detailed below:(1)Addressing the disparity between synthetic and real images,this study proposes methods to mitigate the impact of synthetic images on the training of neural networks.Such methods include utilizing random backgrounds,adding random noise,applying blur and employing random perspective sampling.By employing a two-stage object detection technique and drawing inspiration from transfer learning research,this paper introduces a two-stage finetuning approach to train the detector.Experimental results demonstrate that the proposed approach,as compared to training the object detector directly on synthetic images,improves the detection accuracy by 47.2%.(2)With regards to the issue of the large number of network parameters in the point-based pose estimation method after introducing contrastive learning,which poses challenges in deploying the model to other mobile devices,this paper proposes a lightweight encoder-decoder structure network,namely MLP-ResUnet,which reduces the network parameter count by 55%through the use of depthwise separable convolutions in place of conventional convolutions.(3)While attempting to reduce the size of deep convolutional neural networks,it is inevitable that the quality of the 2D-3D point correspondence predicted by the network will decrease.At this point,the proportion of inliers in the point correspondence may be quite low,and although PnP-RANSAC is robust to outliers,the accuracy will still be limited by the quality of the point correspondence.To address this issue,MaskSAC improves the accuracy of pose estimation when the proportion of inliers is low by introducing mask constraints,and reduces time complexity by precomputing search lines.Experimental results demonstrate an 84.8%reduction in the average pose estimation time compared to the original method,with improvement in accuracy. |