In this work, we introduce a Denser Feature Network (DenserNet) for visual localization. Our work provides three principal contributions. First, we develop a convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels for image representations. Using denser feature maps, our method can produce more keypoint features and increase image retrieval accuracy. Second, our model is trained end-to-end without pixel-level annotation other than positive and negative GPS-tagged image pairs. We use a weakly supervised triplet ranking loss to learn discriminative features and encourage keypoint feature repeatability for image representation. Finally, our method is computationally efficient as our architecture has shared features and parameters during computation. Our method can perform accurate large-scale localization under challenging conditions while remaining the computational constraint. Extensive experiment results indicate that our method sets a new state-of-the-art on four challenging large-scale localization benchmarks and three image retrieval benchmarks.
翻译:在这项工作中,我们引入了 Denser 特征网络( DenserNet) 用于视觉本地化。 我们的工作提供了三大贡献。 首先,我们开发了一个革命性神经网络( CNN) 结构, 将不同语义层次的地图集在一起, 用于图像显示。 我们的方法可以使用更稠密的地貌地图, 产生更多的关键点特征, 并提高图像检索的准确性。 其次, 我们的模型是经过训练的端到端, 除了正和负的GPS标记图像配对之外, 没有像素级的注释。 我们使用一个监督不力的三等分级损失来学习歧视性特征, 并鼓励图像展示的关键点特征重复性。 最后, 我们的方法是计算效率高的, 因为我们的建筑在计算过程中拥有共同的特征和参数。 我们的方法可以在具有挑战性的条件下进行准确的大规模本地化, 同时保持计算限制。 广泛的实验结果表明, 我们的方法在四个挑战的大型本地化基准和三个图像检索基准上建立了一个新的状态。