首页 /研究 /Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos

MANIPULATION

Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos

Chiori Hori, Motonari Kambara, Komei Sugiura, Kei Ota, Sameer Khurana, Siddarth Jain, Radu Corcodel, Devesh K. Jha, Diego Romeres, Jonathan Le Roux

发表年份: 2025
引用次数: 4

摘要

Understanding human actions could allow robots to perform a large spectrum of complex manipulation tasks and make collaboration with humans easier. Recently, multimodal scene understanding using audio-visual Transformers has been used to generate robot action sequences from videos of human demonstrations. However, automatic action sequence generation is not always perfect due to the distribution gap between the training and test environments. To bridge this gap, human intervention could be very effective, such as telling the robot agent what should be done. Motivated by this, we propose an error-correction-based action replanning approach that regenerates better action sequences using (1) automatically generated actions from a pretrained action generator and (2) human error-correction in natural language. We collected singlearm robot action sequences aligned to human action instruction for the cooking video dataset YouCook2. We trained the proposed errorcorrection-based action replanning model using a pre-trained multimodal LLM model (AVBLIP-2), generating a pair of (a) single-arm robot micro-step action sequences and (b) action descriptions in natural language simultaneously. To assess the performance of error correction, we collected human feedback on correcting errors in the automatically generated robot actions. Experiments show that our proposed interactive replanning model trained in a multitask manner using action sequence and description outperformed the baseline model in all types of scores.

关键词

Computer scienceAction (physics)Human–robot interactionAction recognitionRobotArtificial intelligenceHuman–computer interactionComputer vision

Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos

摘要

关键词

相关论文

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory