Home /Research /Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

OTHER

Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Zachary Chavis, Hyun Soo Park, Stephen J. Guy

Year: 2024
Access: Open access

Abstract

Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways -- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting representation will enable robots to use egocentric sensing to navigate to, or around, physical regions of interest for novel tasks specified in natural language.

Keywords

cs.ROcs.CV

Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Abstract

Keywords

Related papers

A dual-loop framework for manufacturability-aware topology optimization of electric vehicle structures via wire arc additive manufacturing

Geometric digital twin: A digital and intelligent model for aero-engine assembly accuracy prediction

Revolutionizing Industries Through AI-Driven Robotics

Design and dynamic performance prediction of a novel large-aperture offset-feed deployable antenna