Home /Research /Human Tide, Clear Sight: Semantically Enhanced Visual Localization in High-Crowd Scenarios
OTHER

Human Tide, Clear Sight: Semantically Enhanced Visual Localization in High-Crowd Scenarios

Yida Wei, Sikang Liu, Zixuan Huang, Wei He, You Li

Year
2025
Citations
1

Abstract

Accurate visual localization is essential in IoT applications, particularly for robotics, autonomous systems, and augmented reality. Traditional feature-based methods struggle with efficiency and robustness against environmental variations. To enhance the robustness of visual localization algorithms against these variations, state-of-the-art (SOTA) methods have incorporated semantic information as an advanced dimension into their models, but still suffer from several shortcomings. These methods often embed semantic information implicitly, which limits their extensibility and interpretability. Moreover, the introduction of some unstable semantic labels may, on the contrary, degrade the localization accuracy. Therefore, modularity, quantization, and filtering semantic labels by their stability become critical. To address these gaps, this article proposes a method that explicitly and quantitatively integrates semantic information through a plug-and-play module. This module scores image-to-image and feature-to-feature correspondences based on semantic similarity and stability, with a particular focus on improving smartphone-based visual localization in high-crowd indoor scenarios. This module is introduced into two key stages of visual hierarchical localization: 1) visual place recognition (coarse localization) and 2) 6-Degree-of-Freedom pose estimation (fine localization). Specifically, correspondences with low scores imply a higher probability of matching errors and are therefore suppressed. To validate the proposed approach, a novel dataset designed for semantic visual localization tasks is collected, rich with dynamic objects and scene variations. The method demonstrates superior accuracy and robustness, particularly in environments with significant scene appearance changes, with 13.6% and 5.4% improvement in localization accuracy in Cafds and Libds datasets, respectively, compared to the SOTA approach. The code and dataset are available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/1da1da/SEVL</uri>.

Keywords

Computer scienceSightVisualizationHuman–computer interactionComputer visionRemote sensingArtificial intelligenceGeology

Related papers

Browse all OTHER papers