Home /Research /A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot
LEARNING

A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot

Oualid Doukhi, Jae‐Ho Lee, Linfeng Wang, Deok Jin Lee

Year
2025
Citations
1

Abstract

This paper explores a novel methodology for zero-shot action generation using vision-language models. By leveraging bootstrapping language-image pre-training(BLIP) and reinforcement learning techniques such as proximal policy optimization(PPO), we establish an approach that maximizes similarity between text_image pairs to determine optimal robotic actions. The methodology demonstrates effective generalization in simulation-to-real-world scenarios for tasks such as human recognition and stair detection. In simulations, the robot detected human with 0.86 similarity in 18.4s and identified stairs with 0.87 similarity in 19.2s, whereas in real-world experiments, it achieved 0.88 similarity for human in 20.3s and 0.89 similarity for stairs recognition in 21.1s, confirming the robustness and adaptability of the framework in diverse environments.

Keywords

Reinforcement learningAction (physics)Zero (linguistics)Artificial intelligenceRobotShot (pellet)Computer scienceComputer visionLinguisticsPhysics

Related papers

Browse all LEARNING papers