Robot Manipulation Command Generation Based On Video Stream

Posted on:2022-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Mo

Full Text:PDF

GTID:2518306539963139

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The task of intelligent systems to understand human movements from videos of human demonstrations and learn new skills is a new trend in the design of self-learning robot.Humans can understand actions and achieve imitation directly by observing the behavior of others.However,enabling robots to perform actions based on observations of human activities is still a major challenge in robotics.As intelligent robots grow in popularity,there will be a growing demand for robotic systems that can understand human demonstrations and perform a variety of tasks.Therefore,in the road of exploring how to improve the level of robot intelligence system,it is of great significance to build a framework to learn robot manipulation instructions from videos.In this paper,we propose a sequence-to-sequence robot instruction generation framework.The framework can automatically generate instructions that can be directly used in robot applications by observing human demonstration videos without special marks.Combined with this framework,the robot can understand human behavior intentions through visual information,thereby completing high-precision and complex tasks in daily life,while effectively avoiding tedious manual programming.More specifically,the framework consists of three steps.In the first step,we use a Mask R-CNN to generate the manipulation area.The Two-Stream Inflated 3D Conv Net(I3D)network is used to extract the optical flow features and RGB image features from the video,and then these two features are fused.In the second step,bidirectional LSTM is introduced to obtain contextual clues from the fused visual features.In the last step,this paper uses two attention mechanisms,including self-attention mechanism and global attention mechanism,to encode visual features and semantic features,so as to learn the correlation between video frame sequence and instruction sequence.The sequence-to-sequence model finally outputs quaternions as instructions for the robot.In order to verify the performance of the proposed framework for robots to learn manipulation knowledge from videos,extensive experiments are conducted on the publicly available Video-to-Command Dataset and the extended MPII Cooking Activities Dataset 2.0.The experimental results show that the method can effectively learn commands for robotic manipulation from human demonstration videos,and achieve the state-of-the-art performance.In addition,the output of the model is applied to the Baxter robot in the real environment.The robot performs the operation tasks corresponding to the various instructions learned from the model.

Keywords/Search Tags:

video to command, fine-grained video captioning, deep learning, robot learning

PDF Full Text Request

Related items

1	Deep Learning-Based Fine-Grained Sports Video Captioning Research
2	Researches On Short Video Captioning Based On Deep Learning
3	Research On Fine-grained Sentiment Analysis Of Video Reviews
4	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning
5	Research On Video Captioning Based On Deep Learning
6	Research On Imitation Learning Of Robot Manipulation Tasks Based On Video Semantic Information
7	Fine-Grained Recognition Of Yunan Wild Bird Images Based On Deep Learning
8	Research And Application Of Video Captioning Technology Based On Deep Learning
9	Video Captioning Based On Deep Learning
10	Research On Fine--grained Image Captioning Method Based On Deep Learning