Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges
Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis, Manoj Karkee
- Year
- 2025
- Citations
- 2
- Access
- Open access
Abstract
Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as autonomous vehicles, medical and industrial robotics, precision agriculture, humanoid robotics, and augmented reality. We analyzed challenges and propose solutions including agentic adaptation and cross-embodiment planning. Furthermore, we outline a forward-looking roadmap where VLA models, VLMs, and agentic AI converge to strengthen socially aligned, adaptive, and general-purpose embodied agents. This work, is expected to serve as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. The project repository is available on GitHub as https://github.com/Applied-AI-Research-Lab/Vision-Language-Action-Models-Concepts-Progress-Applications-and-Challenges. [Index Terms: Vision Language Action, VLA, Vision Language Models, VLMs, Action Tokenization, NLP]
Keywords
Related papers
Artificial intelligence: a modern approach
1995
Are we ready for autonomous driving? The KITTI vision benchmark suite
Andreas Geiger, P Lenz, R. Urtasun
2012
Self-Organizing Maps
Teuvo Kohonen
1995
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martı́n Abadi, Ashish Agarwal, Paul Barham +17 more
2016