CMTNet: a transformer-based network for LiDAR-camera cross-modal calibration
Xianming Lang, Qiang Liu, Jiu‐Yu Ji
- Year
- 2025
- Citations
- 2
Abstract
Abstract Autonomous vehicles and robots work in a dynamic environment, which includes complex urban streets, dynamic obstacles, and complex sensing environments. It makes the perception task more challenging. A single type of sensor alone cannot meet the needs of target detection. Multimodal sensor fusion, which combines LiDAR and camera modalities, provides complementary 2D semantic and 3D geometric information. The performance of multimodal sensor fusion critically depends on precise extrinsic calibration between sensors. We propose CMTNet, a novel cross-modal transformer architecture for robust extrinsic parameters estimation. The method uses depth maps as a unified representation of images and LiDAR point clouds. We utilize the ResNet-18 network to extract relative depth and semantic features from the monocular depth map. From the point cloud depth map, we extract precise 3D geometric features. Then, the correlation layer fuses the two features. Finally, the transformer estimates accurate calibration parameters based on multimodal features. We evaluated our method on the KITTI raw dataset, and it outperformed other methods. In addition, extensive experiments evaluating the model on KITTI odometry demonstrated that our method exhibited well generalization ability.
Keywords
Related papers
Statistical Learning Theory
Yuhai Wu, Vladimir Vapnik
1999
Artificial intelligence: a modern approach
1995
Fractional Differential Equations
Igor Podlubný
2025
Applied Nonlinear Control
Jean-Jacques Slotine, Weiping Li
1991