A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation
Xin Zhang, Liangxiu Han, Sergio Davies, Tam Sobeih, Lianghao Han, Darren Dancey
- 发表年份
- 2025
- 引用次数
- 2
摘要
Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labelled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications. • Novel energy-efficient Spike Transformer for depth estimation from event cameras. • Purely spike driven transformer with spike-based attention and residual mechanisms. • Fusion depth head combines multi-stage features for fine-grained prediction. • Cross-modality knowledge distillation from DINOv2 enhances SNN training.
关键词
相关论文
Artificial intelligence: a modern approach
1995
A new optimizer using particle swarm theory
R.C. Eberhart, James Kennedy
2002
Are we ready for autonomous driving? The KITTI vision benchmark suite
Andreas Geiger, P Lenz, R. Urtasun
2012
Self-Organizing Maps
Teuvo Kohonen
1995