Robot Instance Segmentation with Few Annotations for Grasping

Moshe Kimhi, David Vainshtein, Chaim Baskin, Dotan Di Castro

Year: 2025
Citations: 4

Abstract

The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains charac-terized by cluttered scenes and high object variability such as traffic, navigation and object grasping, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite tempo-ral gaps without requiring curated data of interaction se-quences. As a result, our approach exploits partially anno-tated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unla-beled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARM-Bench, we attain an AP50 of 86.37, almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an AP<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">50</inf> score of 84.89 with just 1 % of annotated data compared to previous state of the art of 82 which targeted the fully anno-tated dataset.

Keywords

Computer scienceArtificial intelligenceRobotComputer visionSegmentationHuman–computer interaction

Robot Instance Segmentation with Few Annotations for Grasping

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory