Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.
翻译:变形者是强大的视觉学习者,这在很大程度上是由于他们明显缺乏手动指定的前科。 在涉及多视几何学的任务中,这种灵活性可能存在问题,因为3D形状和观点的几乎无穷无尽的变化(需要灵活性)以及投影几何学的精确性质(克服僵硬的定律 ) 。 为了解决这一难题,我们建议了一种“轻触”方法,引导视觉变形者学习多视几何学,但在必要时允许他们自由破解。 我们通过使用上极线来指导变形者交叉注意的地图,惩罚上极线外的注意值,并鼓励沿着这些行更加关注。 与以往的方法不同,我们的提案并不要求任何相机在测试时提供信息。 我们侧重于变形物体的检索,由于查询和检索图像之间的观点差异很大,标准变形器网络在其中挣扎。 实验性地,我们的方法在对象检索时超越了最先进的方法,不需要在测试时显示信息。