Colosseum V2: Benchmarking Generalization for Vision Language Action Models
Jeremy Morgan, Prajwal Vijay, Hyeonho Oh, Jincen Song, Ashvin Arora, Alina Du, Gaurav Sukhatme, Jesse Thomason, Ishika Singh
2026
Abstract
Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.
Keywords
Related papers
State-of-the-art in mobile robot-assisted grinding technologies for large-scale complex components
Yusen Li, Ziwei Wang, Xiangye Zhu +9 more
Robotics and Computer-Integrated Manufacturing · 2026
A fusion prediction model of tool wear based on physical information and machine learning in five-axis milling TC4 titanium alloy
Shaoqing Qin, Lida Zhu, Yanpeng Hao +7 more
Robotics and Computer-Integrated Manufacturing · 2026
Enhancing robotic milling quality via a novel piezoelectric active damping toolholder
Bo Li, Yuanbo Zhao, Huijie Xiao +3 more
Robotics and Computer-Integrated Manufacturing · 2026
A novel method of suppressing low-frequency chatter in robotic milling using magnetically-induced nonlinear broadband multidirectional passive vibration absorber
Hao Li, Yuhui Yu, Rui Fu +3 more
Robotics and Computer-Integrated Manufacturing · 2026