首页 /研究 /This&That: Language-Gesture Controlled Video Generation for Robot Planning

LEARNING

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park

发表年份: 2024
访问权限: 开放获取

摘要

Clear, interpretable instructions are invaluable when attempting any complex task. Good instructions help to clarify the task and even anticipate the steps needed to solve it. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That uses language-gesture conditioning to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in complex and uncertain environments. These video predictions are then fed into a behavior cloning architecture dubbed Diffusion Video to Action (DiVA), which outperforms prior state-of-the-art behavior cloning and video-based planning methods by substantial margins.

关键词

cs.ROcs.AIcs.CV

This&That: Language-Gesture Controlled Video Generation for Robot Planning

摘要

关键词

相关论文

The Organization of Behavior

Fractional Brownian Motions, Fractional Noises and Applications

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A guide to deep learning in healthcare