A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot
Oualid Doukhi, Jae‐Ho Lee, Linfeng Wang, Deok Jin Lee
- Year
- 2025
- Citations
- 1
Abstract
This paper explores a novel methodology for zero-shot action generation using vision-language models. By leveraging bootstrapping language-image pre-training(BLIP) and reinforcement learning techniques such as proximal policy optimization(PPO), we establish an approach that maximizes similarity between text_image pairs to determine optimal robotic actions. The methodology demonstrates effective generalization in simulation-to-real-world scenarios for tasks such as human recognition and stair detection. In simulations, the robot detected human with 0.86 similarity in 18.4s and identified stairs with 0.87 similarity in 19.2s, whereas in real-world experiments, it achieved 0.88 similarity for human in 20.3s and 0.89 similarity for stairs recognition in 21.1s, confirming the robustness and adaptability of the framework in diverse environments.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002