Home /Research /ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics
MANIPULATION

ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics

Donato Caramia, Florian T. Pokorny, Giuseppe Triggiani, Denis Ruffino, David Naso, Paolo Roberto Massenio

Year
2025
Access
Open access

Abstract

Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.

Keywords

cs.ROcs.CV

Related papers

Browse all MANIPULATION papers