Augmenting Open Vocabulary Object Detection with Large Language Models for Home Service Robots*
Eric Martinson, H. Indurthi
- Year
- 2025
- Citations
- 1
Abstract
Practical service robots struggle with traditional object detection methods. Extensive hand labeling requirements mean a limited set of objects that robots can detect, which in turn significantly reduces utility. New open-vocabulary based detection models like CLIP and Yolo-World can find objects associated with arbitrary search queries without requiring additional hand-labeled data. Unfortunately, these methods are significantly less accurate. Multimodal large language models have been suggested as viable alternatives, but their computational load generally requires uploading significant quantities of images to the cloud for processing – which is both a privacy risk for home deployments and expensive in general. This work proposes an alternative for integrating with LLMs. Without uploading any images to the cloud, we demonstrate how a text-only LLM can support open vocabulary object detection, integrating knowledge of object size, room type and alternative text queries to significantly improve F-score and specificity. The approach is verified against the ScanNet dataset, and further demonstrated to work in real time on the Strech 2 mobile robot.
Keywords
Related papers
Artificial intelligence: a modern approach
1995
Are we ready for autonomous driving? The KITTI vision benchmark suite
Andreas Geiger, P Lenz, R. Urtasun
2012
Self-Organizing Maps
Teuvo Kohonen
1995
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Martı́n Abadi, Ashish Agarwal, Paul Barham +17 more
2016