Home /Research /Vote-based multimodal fusion for hand-held object pose estimation
MANIPULATION

Vote-based multimodal fusion for hand-held object pose estimation

Dinh-Cuong Hoang, Phan Xuan Tan, Anh-Nhat Nguyen, Duc-Long Pham, Van-Duc Vu, Van-Thiep Nguyen, Thu-Uyen Nguyen, Duc-Thanh Tran, Khanh-Toan Phan, Xuan-Tung Dinh, Van-Hiep Duong, Ngoc-Trung Ho, Hai-Nam Pham, Viet-Anh Trinh, Son-Anh Bui

Year
2025
Citations
1

Abstract

Estimating the pose of hand-held objects is a critical and challenging task in computer vision and robotics, with applications in robotic manipulation, human–robot interaction, and augmented reality (AR). Leveraging multi-modal data, such as color (RGB) and depth (D) images, provides a promising avenue for addressing these challenges. However, existing approaches face two significant limitations. First, hand-induced occlusions often obscure critical object features, limiting the accuracy of conventional pose estimation methods. Second, most current techniques extract features from separate backbones and fuse them at the feature level, which can lead to representation distribution shifts and performance disruptions during fine-tuning due to dense interactions between RGB and depth branches. In this work, we propose a novel deep neural network for hand-held object pose estimation using RGB-D images as input. Our approach introduces a vote-based fusion mechanism that dynamically integrates multimodal data, effectively addressing occlusions and representation misalignments. Additionally, we incorporate hand-object keypoint interactions through a specialized module, enabling more accurate pose estimation in complex scenarios. Experiments on three public datasets demonstrate significant improvements in accuracy and robustness, with accuracy gains of up to 15% over state-of-the-art methods. Furthermore, on-site experimental verification highlights the practicality of our framework, achieving an average precision of 76.8% and outperforming existing methods by margins of up to 13.9%. The proposed method also achieves competitive inference times of 40 ms without refinement and 200 ms with refinement, demonstrating its suitability for real-world applications.

Keywords

Artificial intelligenceFusionObject (grammar)PoseComputer scienceComputer visionEstimationSensor fusionPattern recognition (psychology)Engineering

Related papers

Browse all MANIPULATION papers