Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/Zalex97/VMLoc.
翻译:最近以学习为基础的方法在单一镜头摄像师定位领域取得了令人印象深刻的成果,然而,如何最好地结合多种模式(例如图像和深度)和处理退化或缺失的投入,研究得不多,特别是,我们注意到,以往的深层融合方法并不比使用单一模式的模型表现得好得多。我们推测,这是因为在通过总和或混合来突出空间融合的天真的方法中,没有考虑到每种模式的不同长处。为了解决这个问题,我们提议了一个端到端框架,称为VMLoc,将不同的传感器输入物结合到一个共同的潜在空间,通过一个变化式的勘探产品(PoE),然后是基于注意的聚合。不同于以往的多式联运变异方法直接调整了香草变异自动电解码的目标功能。我们推测,通过基于重要性加权的不偏不倚目标功能,摄影机本地化如何准确估计。我们的模型在RGB-D数据集上进行了广泛评价,结果证明了我们模型的功效。源代码可在 http://github.Zal-97/Zalexoc查阅。