Home /Research /Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*
PERCEPTION

Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*

Eric Martinson, H. Indurthi

Year
2025
Citations
1

Abstract

Practical service robots struggle with traditional object detection methods. Extensive hand labeling requirements mean a limited set of objects that robots can detect, which in turn significantly reduces utility. New open-vocabulary based detection models like CLIP and Yolo-World can find objects associated with arbitrary search queries without requiring additional hand-labeled data. Unfortunately, these methods are significantly less accurate. Multimodal large language models have been suggested as viable alternatives, but their computational load generally requires uploading significant quantities of images to the cloud for processing – which is both a privacy risk for home deployments and expensive in general. This work proposes an alternative for integrating with LLMs. Without uploading any images to the cloud, we demonstrate how a text-only LLM can support open vocabulary object detection, integrating knowledge of object size, room type and alternative text queries to significantly improve F-score and specificity. The approach is verified against the ScanNet dataset, and further demonstrated to work in real time on the Strech 2 mobile robot.

Keywords

UploadVocabularyObject (grammar)Object detectionSet (abstract data type)Service (business)Robot

Related papers

Browse all PERCEPTION papers