Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html
翻译:人类由于视觉皮层支持对 3D 结构的感知,在理解观点变化方面非常灵活。 相反,从 2D 图像库中学习视觉表现的计算机视觉模型大多无法在2D 图像群中进行概括化。 最近,视觉结构已经转向无革命结构、视觉变异器,这些结构在图像补丁衍生的象征物上运作。然而,这些变异器和2D 共变网络都没有进行明确的操作,以便从 2D 补丁 中学习视觉- 不可知的表示。为此,我们提议3D Token 代表层 (3DTRL),用来估计视觉象征的3D 位置信息,并利用它来学习视觉- 不可辨别的表示。 最近, 3DTRL 的关键元素包括一个假深度的估测器和一个学习的相机矩阵矩阵矩阵,用来对符号进行几何转换。 这些使 3DTRL 能够从 2D 补丁 补丁 中恢复 3D 标记的 位置信息。 在实践中, 3DTRper 将很容易插入到一个变换器中。 我们的3TR 3 3 3 图像模型的实验展示中显示3TRL 的3, 包括我们最起码的动作 的动作 的动作, 3L 的动作 的动作 的动作 的动作,, 和我们的动作的动作的动作 的动作的动作的动作的动作 的动作 。