Home /Research /ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems
PERCEPTION

ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems

Xinhai Sun, Xiang Shi, Menglin Zou, Wenlong Huang

Year
2026
Access
Open access

Abstract

The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.

Keywords

cs.RO

Related papers

Browse all PERCEPTION papers