Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
翻译:Transformers是强大的视觉学习者,这在很大程度上是由于它们明显缺乏手动指定的先验知识。在涉及到多视角几何的任务中,这种灵活性可能会导致问题,因为三维形状和视角的几乎无限的可能变化,需要灵活性,而射影几何的精确性则遵守严格的定律。为了解决这个难题,我们提出了一种"轻度干预"的方法,引导视觉Transformers学习多视图几何,但在需要的时候也允许它们打破自由。我们通过使用极线来引导Transformer的交叉关注映射,惩罚极线外的关注值,并鼓励更高的关注值沿这些线,因为它们包含几何上可信的匹配。与以往的方法不同,我们的提议不需要在测试时任何相机姿态信息。我们专注于姿态不变的对象实例检索,这是标准Transformer网络难以处理的问题,由于查询和检索图像之间的视角差异很大。实验证明,我们的方法在对象检索方面优于最先进的方法,在测试时不需要姿态信息。