Font Size: a A A

Research On Imitation Learning Of Robot Manipulation Tasks Based On Video Semantic Information

Posted on:2023-06-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C YinFull Text:PDF
GTID:1528307025962559Subject:Light industry machinery and packaging engineering
Abstract/Summary:PDF Full Text Request
Robots are playing an increasingly important role in human production and life.The ability of robots to quickly learn to manipulate objects in an unstructured dynamic environment can greatly reduce the application threshold of robots and expand the application fields of robots.This is of great significance for accelerating the digital transformation of traditional manufacturing and promoting the intelligence of social life.Humans can easily learn a new manipulation task by observing others’ demonstrations.However,it is a major challenge for robots to learn useful information from human demonstrations and use this information to reproduce the tasks in the demonstration.Scene perception and task understanding based on visual information is one of the important ways for robots to learn from demonstrations,and is the key step in robotic imitation learning tasks based on video semantic information.At present,the research on imitation learning of robot manipulation tasks based on video semantic information is still in the exploratory stage,and it still faces the following challenges to apply it to practical production and life:(1)Accurate detection of object affordances.By detecting object affordances,fine-grained manipulation information is provided for robots to manipulate objects,so that robots can understand the functional properties of objects like humans.(2)Detect and identify fine-grained manipulation tasks from demonstration videos.Learning manipulation tasks directly from demonstration videos containing a lot of redundant information without the assistance of other devices places high demands on the robot’s scene understanding ability.(3)Fuse multimodal demonstration information to generate robot commands in an end-to-end manner.How to use information such as speech or text to guide robots to focus on task-related visual information in demonstration videos in an end-to-end model remains to be further studied.In response to the above problems,this paper has carried out the study of imitation learning of robot manipulation tasks based on semantic information of demonstration videos.The main research contents and relevant conclusions are as follows:(1)Aiming at the problem of object affordance segmentation in robot manipulation tasks,a semantic edge aware network(SEANet)is proposed to provide the object affordance segmentation results with high edge quality.Considering that semantic segmentation and semantic edge detection are a set of dual tasks,corresponding to the low-frequency and highfrequency parts of the image respectively,a spatial gradient fusion module is designed to couple the two tasks together.On the other hand,a shared gradient attention module is designed to guide the model to focus on the common gradient feature of the two tasks,namely affordance edges.Finally,two duality loss functions are proposed to impose the edge consistency constraint on the model during training to further explore the duality between semantic segmentation and semantic edge detection.Experiments show that the proposed spatial gradient fusion module and shared gradient attention module interact with each other during training to enhance the edge consistency between the two tasks.The joint optimization of the two tasks can improve the edge quality of object affordance segmentation results and provide accurate position information for the robot to manipulate the objects.(2)For the multi-object affordance segmentation problem,a boundary-preserving network(BPN)is proposed to identify and locate multiple objects simultaneously,and perform affordance segmentation for each object.A new Io U branch is added to BPN,and an object bounding box screening strategy based on intersection and union ratio is proposed to obtain bounding boxes that cover the whole object as much as possible.Similar to SEANet,BPN also has a semantic edge supervision branch to guide the model to focus on the edge regions of objects.To enhance the feature representation capability of the model,a relationship attention module is designed to model the potential associations between object categories and affordance categories.Experiments show that,compared with the baseline model Mask RCNN-D,the object bounding box output by BPN can retain more complete object edge information.BPN pays more attention to the edge of object affordances,and can output the affordance segmentation results with higher edge quality.BPN can provide the robot with information about the categories,locations and affordances of multiple objects at the same time to help the robot manipulate objects in a cluttered environment.(3)To solve the problem of generating robot manipulation commands from unconstrained demonstration videos,a command generation model(CGM)based on multimodal information is proposed.CGM consists of five components: text encoder,video encoder,action classifier,keyframe alignment module and command decoder.The text encoder is used to extract the text features of video captions.Video encoder is used to extract visual features of demonstration videos.The action classifier is used to output the action categories in the videos.The key frame alignment module is used to align the text features of the partially decoded command with the relevant visual features in the image sequence to extract the key frame features.The command decoder generates robot manipulation commands based on the extracted features mentioned above.In addition,a mask training method is proposed to train the CGM.Experiments show that the proposed CGM can extract the global spatiotemporal features and local keyframe information of the demonstration video with the help of the above components,and fuse them with the caption information to generate robot manipulation commands.(4)Based on the Sawyer robot,an experimental platform for imitation learning of robot manipulation tasks is built.Taking “pour water from the cup into the bowl” as an example,an experimental study on imitation learning of robot manipulation tasks based on semantic information of demonstration video is carried out.Experiments show that,under the experimental framework proposed in this paper,combining the proposed BPN and CMM-M with existing Grasp Net,DMP and other methods can make the Sawyer robot imitate and reproduce the task of “pour water from the cup into the bowl”.
Keywords/Search Tags:Learning from demonstration, Affordance detection, Video to command, Robot manipulation task, Deep learning
PDF Full Text Request
Related items