Enhancing Spatial Awareness via Multi-Modal Fusion of CNN-Based Visual and Depth Features
Babar Hussain, Jiandong Guo, Fareed Sidra, Luyao Chen
- Year
- 2025
- Citations
- 1
- Access
- Open access
Abstract
Achieving accurate spatial awareness is a fundamental requirement for intelligent vision systems operating in complex and dynamic environments, such as autonomous navigation, robotic manipulation, and augmented reality. While Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in tasks such as image classification and semantic segmentation, their inherently two-dimensional structure limits their ability to model and reason about three-dimensional spatial relationships. Specifically, CNNs are constrained by local receptive fields, a lack of explicit geometric context, and their dependence on appearance-based cues, which often results in inaccurate understanding of object boundaries, depth discontinuities, and occlusions in real-world scenes. To address these limitations, this paper investigates the fusion of RGB visual data with depth information through a multi-modal intermediate fusion framework. We propose a lightweight experimental prototype that integrates parallel feature extraction pipelines for RGB images and corresponding depth maps, followed by feature-level fusion to enhance semantic and geometric understanding. The experiment is conducted on the NYU Depth V2 dataset, which provides densely labeled indoor scenes with aligned RGB and depth data. A comparative analysis is performed between a baseline CNN model trained solely on RGB input and a modified model utilizing intermediate fusion of RGB and depth features. Experimental results indicate that the inclusion of depth information significantly improves the model’s ability to delineate object boundaries, resolve foreground-background ambiguities, and maintain semantic coherence across varying spatial scales. The depth-enhanced model demonstrates increased robustness to occlusions and illumination changes, highlighting the practical benefits of integrating geometric cues into visual perception pipelines. These findings provide empirical support for the theoretical premise that multi-modal feature fusion can substantially enhance spatial reasoning in CNN-based architectures. This study contributes both a conceptual understanding and an applied perspective on the design of multi-modal spatial systems. The results serve as a foundation for further development of robust, depth-aware visual perception models with applications in real-time robotics, autonomous systems, and immersive AR/VR environments.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002