首页 /研究 /A tutorial note on collecting simulated data for vision-language-action models
SWARM

A tutorial note on collecting simulated data for vision-language-action models

Heran Wu, Zirun Zhou, Jingfeng Zhang

发表年份
2025
访问权限
开放获取

摘要

Traditional robotic systems typically decompose intelligence into independent modules for computer vision, natural language processing, and motion control. Vision-Language-Action (VLA) models fundamentally transform this approach by employing a single neural network that can simultaneously process visual observations, understand human instructions, and directly output robot actions -- all within a unified framework. However, these systems are highly dependent on high-quality training datasets that can capture the complex relationships between visual observations, language instructions, and robotic actions. This tutorial reviews three representative systems: the PyBullet simulation framework for flexible customized data generation, the LIBERO benchmark suite for standardized task definition and evaluation, and the RT-X dataset collection for large-scale multi-robot data acquisition. We demonstrated dataset generation approaches in PyBullet simulation and customized data collection within LIBERO, and provide an overview of the characteristics and roles of the RT-X dataset for large-scale multi-robot data acquisition.

关键词

cs.RO

相关论文

查看 SWARM 分类全部论文