Home /Research /CMTNet: a transformer-based network for LiDAR-camera cross-modal calibration

PERCEPTION

CMTNet: a transformer-based network for LiDAR-camera cross-modal calibration

Xianming Lang, Qiang Liu, Jiu‐Yu Ji

Year: 2025
Citations: 2

Abstract

Abstract Autonomous vehicles and robots work in a dynamic environment, which includes complex urban streets, dynamic obstacles, and complex sensing environments. It makes the perception task more challenging. A single type of sensor alone cannot meet the needs of target detection. Multimodal sensor fusion, which combines LiDAR and camera modalities, provides complementary 2D semantic and 3D geometric information. The performance of multimodal sensor fusion critically depends on precise extrinsic calibration between sensors. We propose CMTNet, a novel cross-modal transformer architecture for robust extrinsic parameters estimation. The method uses depth maps as a unified representation of images and LiDAR point clouds. We utilize the ResNet-18 network to extract relative depth and semantic features from the monocular depth map. From the point cloud depth map, we extract precise 3D geometric features. Then, the correlation layer fuses the two features. Finally, the transformer estimates accurate calibration parameters based on multimodal features. We evaluated our method on the KITTI raw dataset, and it outperformed other methods. In addition, extensive experiments evaluating the model on KITTI odometry demonstrated that our method exhibited well generalization ability.

Keywords

TransformerModalLidarCalibrationComputer scienceRemote sensingElectrical engineeringGeologyVoltageMathematics

CMTNet: a transformer-based network for LiDAR-camera cross-modal calibration

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Fractional Differential Equations

Applied Nonlinear Control