Home /Research /Real-time open-vocabulary perception for mobile robots on edge devices: a systematic analysis of the accuracy-latency trade-off
HRI

Real-time open-vocabulary perception for mobile robots on edge devices: a systematic analysis of the accuracy-latency trade-off

Jongyoon Park, Pileun Kim, Daijin Ko

Year
2025
Citations
2
Access
Open access

Abstract

The integration of Vision-Language Models (VLMs) into autonomous systems is of growing importance for improving Human-Robot Interaction (HRI), enabling robots to operate within complex and unstructured environments and collaborate with non-expert users. For mobile robots to be effectively deployed in dynamic settings such as domestic or industrial areas, the ability to interpret and execute natural language commands is crucial. However, while VLMs offer powerful zero-shot, open-vocabulary recognition capabilities, their high computational cost presents a significant challenge for real-time performance on resource-constrained edge devices. This study provides a systematic analysis of the trade-offs involved in optimizing a real-time robotic perception pipeline on the NVIDIA Jetson AGX Orin 64GB platform. We investigate the relationship between accuracy and latency by evaluating combinations of two open-vocabulary detection models and two prompt-based segmentation models. Each pipeline is optimized using various precision levels (FP32, FP16, and Best) via NVIDIA TensorRT. We present a quantitative comparison of the mean Intersection over Union (mIoU) and latency for each configuration, offering practical insights and benchmarks for researchers and developers deploying these advanced models on embedded systems.

Keywords

Pipeline (software)Intersection (aeronautics)RobotMobile robotEnhanced Data Rates for GSM EvolutionLatency (audio)PerceptionSegmentation

Related papers

Browse all HRI papers