Home /Research /Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*

PERCEPTION

Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*

Eric Martinson, H. Indurthi

Year: 2025
Citations: 1

Abstract

Practical service robots struggle with traditional object detection methods. Extensive hand labeling requirements mean a limited set of objects that robots can detect, which in turn significantly reduces utility. New open-vocabulary based detection models like CLIP and Yolo-World can find objects associated with arbitrary search queries without requiring additional hand-labeled data. Unfortunately, these methods are significantly less accurate. Multimodal large language models have been suggested as viable alternatives, but their computational load generally requires uploading significant quantities of images to the cloud for processing – which is both a privacy risk for home deployments and expensive in general. This work proposes an alternative for integrating with LLMs. Without uploading any images to the cloud, we demonstrate how a text-only LLM can support open vocabulary object detection, integrating knowledge of object size, room type and alternative text queries to significantly improve F-score and specificity. The approach is verified against the ScanNet dataset, and further demonstrated to work in real time on the Strech 2 mobile robot.

Keywords

UploadVocabularyObject (grammar)Object detectionSet (abstract data type)Service (business)Robot

Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*

Abstract

Keywords

Related papers

Artificial intelligence: a modern approach

Are we ready for autonomous driving? The KITTI vision benchmark suite

Self-Organizing Maps

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems