Evidence from cognitive psychology suggests that understanding spatio-temporal object interactions and dynamics can be essential for recognizing actions in complex videos. Therefore, action recognition models are expected to benefit from explicit modeling of objects, including their appearance, interaction, and dynamics. Recently, video transformers have shown great success in video understanding, exceeding CNN performance. Yet, existing video transformer models do not explicitly model objects. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric spatio-temporal representations throughout multiple transformer layers. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an ``Object-Region Attention'' element applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate ``Object-Dynamics Module'', which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on standard and compositional action recognition on Something-Something V2, standard action recognition on Epic-Kitchen100 and Diving48, and spatio-temporal action detection on AVA. We show strong improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at https://roeiherz.github.io/ORViT/.
翻译:来自认知心理学的证据表明,理解时空物体相互作用和动态对于识别复杂视频中的行为至关重要。 因此, 行动识别模型预计将受益于对对象的清晰模型, 包括外观、 互动和动态。 最近, 视频变压器在视频理解上表现出极大的成功, 超过了CNN的性能。 然而, 现有的视频变压器模型并不明显模型对象。 在此工作中, 我们展示了“ 目标- 区域变压器” (ORVIT), 一种将视频变压器层与一个块相扩展, 并直接包含对象表示。 因此, 关键的想法是将物体- 中心 spartio- 时空表达器( 包括外观、 互动器和动态) 。 在外观流中, 一个“ Objectr- region 注意” 元素将自我保护功能应用于补丁和 emphurph 之前的区域 。 在此过程中, 视觉对象区域与统一的补置符号进行互动, 并用背景化的物体信息来充实它们。 我们通过一个单独的磁体变压变压模型模型模型, 在轨图解结构中, 显示我们的标准变压的动作结构结构的动作和动作的动作显示, 我们的动作的动作的动作的动作和动作的识别图解图解显示。