| Understanding the categories,positions,and poses of objects in indoor scenes is a key prerequisite for human-computer interaction tasks in robotic capture,virtual reality,and augmented reality.At present,many algorithms are based on scenes that are not complicated,with little or no occlusion between objects,and can only deal with a single target in a scene.However,in a real pose estimation scenario,most objects are placed in disorder and there is a presence between objects.Complex scenes such as occlusion or self-occlusion,so the robustness and generalization of complex scenes for pose estimation algorithms is still a challenging research problem.This paper focuses on this issue from the following aspects:Firstly,for many occlusion problems between indoor objects,this paper improves the stack denoising self-encoder and proposes an indoor occlusion target image reconstruction method based on enhanced self-encoder.The algorithm first adds random noise(such as Gaussian noise,random Mask occlusion,etc.)to the input image;then inputs the image with increased noise to the enhanced self-encoder,and after encoding and decoding,outputs a vector of the same dimension as the original image;Finally,the vector is converted to an image of the same size as the original input image.The comparison of the images before and after reconstructing the LINEMOD dataset shows that the enhanced self-encoder is a feasible method for occlusion target reconstruction,which can be easily integrated with other networks.Sencodly,for many algorithms unable to deal with multi-target detection problems in complex scenes,first improve the original LINEMOD single-target data set as multitarget data set,and then use the improved Faster R-CNN network to carry out multiobjective experiments in complex scenes.The improved content of Faster R-CNN includes the use of ResNet101 with deeper network layer as the extraction and extraction network to improve the network feature extraction capability.Refer to Mask R-CNN's ROI Align downsampling method to improve the downsampling accuracy and reduce the original anchor point frame.The size is more suitable for small target objects.The modified Faster R-CNN has very good performance for multi-target LINEMOD dataset target detection.However,because there are a large number of mutual occlusion problems in the dataset,this will have some impact on further research.Thirdly,For the pose estimation problem,this paper uses the PnP algorithm to find the 6D rotation and 6D translation of the object based on the key points of the object.When predicting the key points of an object from the image,it is necessary to consider the problem that the target key point cannot be accurately predicted due to the occlusion of the target.It is proposed to reconstruct the target object region of interest output by Faster R-CNN first by using the above enhanced automatic encoder.The key points of the object are then regressed by adding a fully connected layer behind the enhanced self-encoder.Experiments have shown that even if there is a problem that the target is occluded,the algorithm can accurately return the key points of the object.Finally,the PnP algorithm is used to find the 6D rotation and 6D translation of the object to find the object pose.Compared with other algorithms for pose estimation,the method adopted is more accurate,and even if the target in the image is occluded,it still has good performance.Finally,Although the above method can accurately estimate the pose of the object,the algorithm is not end-to-end.Therefore,this paper refers to the reward-punishment strategy of reinforcement learning,and explores the end-to-end indoor object pose estimation method based on reinforcement learning.The algorithm has a probability according to each type of object output by Faster R-CNN,and different probabilities will affect the attitude result of the enhanced self-encoder output,so the probability expectation can be obtained by the probability estimation result of the attitude estimation result and the target detection output.Backpropagation,an end-to-end attitude estimation algorithm. |