首页 /研究 /Incremental Joint Learning of Depth, Pose, and Implicit Scene Representation on Monocular Camera in Large-Scale Scenes

PERCEPTION

Incremental Joint Learning of Depth, Pose, and Implicit Scene Representation on Monocular Camera in Large-Scale Scenes

Tianchen Deng, Nailin Wang, Chongdi Wang, Shenghai Yuan, Jingchuan Wang, Hesheng Wang, Danwei Wang, Weidong Chen

发表年份: 2025
引用次数: 5

摘要

Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, and robotics navigation. Existing dense reconstruction methods are primarily designed for small room scenarios, but in practice, the scenes encountered by robots are typically large-scale environments. Most existing methods have difficulties in large-scale scenes due to three core challenges: <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(a) inaccurate depth input</i>. Depth information is crucial for both scene geometry reconstruction and pose estimation. Accurate depth input is impossible to get in real-world large-scale scenes. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(b) inaccurate pose estimation</i>. Existing methods are not robust enough with the growth of cumulative errors in large scenes and long sequences. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">(c) insufficient scene representation capability</i>. A single global radiance field lacks the capacity to scale effectively to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale dense scene reconstruction. For depth estimation, a vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes and eliminates pose drift. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. In local radiance fields, we propose a tri-plane based scene representation method to further improve the accuracy and efficiency of scene reconstruction. We conduct extensive experiments on various datasets, including our own collected data, to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction. The code has been open-sourced on https://github.com/dtc111111/incre-dpsr.

关键词

RadianceRepresentation (politics)Bundle adjustmentMonocularScalabilityRobotScale (ratio)RoboticsSegmentationPose

Incremental Joint Learning of Depth, Pose, and Implicit Scene Representation on Monocular Camera in Large-Scale Scenes

摘要

关键词

相关论文

Artificial intelligence: a modern approach

Are we ready for autonomous driving? The KITTI vision benchmark suite

Self-Organizing Maps

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems