Font Size: a A A

Robot Manipulation Command Generation Based On Video Stream

Posted on:2022-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Y MoFull Text:PDF
GTID:2518306539963139Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The task of intelligent systems to understand human movements from videos of human demonstrations and learn new skills is a new trend in the design of self-learning robot.Humans can understand actions and achieve imitation directly by observing the behavior of others.However,enabling robots to perform actions based on observations of human activities is still a major challenge in robotics.As intelligent robots grow in popularity,there will be a growing demand for robotic systems that can understand human demonstrations and perform a variety of tasks.Therefore,in the road of exploring how to improve the level of robot intelligence system,it is of great significance to build a framework to learn robot manipulation instructions from videos.In this paper,we propose a sequence-to-sequence robot instruction generation framework.The framework can automatically generate instructions that can be directly used in robot applications by observing human demonstration videos without special marks.Combined with this framework,the robot can understand human behavior intentions through visual information,thereby completing high-precision and complex tasks in daily life,while effectively avoiding tedious manual programming.More specifically,the framework consists of three steps.In the first step,we use a Mask R-CNN to generate the manipulation area.The Two-Stream Inflated 3D Conv Net(I3D)network is used to extract the optical flow features and RGB image features from the video,and then these two features are fused.In the second step,bidirectional LSTM is introduced to obtain contextual clues from the fused visual features.In the last step,this paper uses two attention mechanisms,including self-attention mechanism and global attention mechanism,to encode visual features and semantic features,so as to learn the correlation between video frame sequence and instruction sequence.The sequence-to-sequence model finally outputs quaternions as instructions for the robot.In order to verify the performance of the proposed framework for robots to learn manipulation knowledge from videos,extensive experiments are conducted on the publicly available Video-to-Command Dataset and the extended MPII Cooking Activities Dataset 2.0.The experimental results show that the method can effectively learn commands for robotic manipulation from human demonstration videos,and achieve the state-of-the-art performance.In addition,the output of the model is applied to the Baxter robot in the real environment.The robot performs the operation tasks corresponding to the various instructions learned from the model.
Keywords/Search Tags:video to command, fine-grained video captioning, deep learning, robot learning
PDF Full Text Request
Related items