Home /Research /LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

PERCEPTION

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Yihao Wang, Raphael Memmesheimer, Sven Behnke

Year: 2025
Access: Open access

Abstract

The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

Keywords

cs.CVcs.AIcs.RO

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Abstract

Keywords

Related papers

Artificial intelligence: a modern approach

Are we ready for autonomous driving? The KITTI vision benchmark suite

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Vision meets robotics: The KITTI dataset