Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{https://roeiherz.github.io/ORViT/}
翻译:最近,视频变压器在视频理解方面表现出了巨大的成功,超过了CNN的性能;然而,现有的视频变压器模型并不明显地模拟对象,尽管对象可能是识别动作的关键。在这项工作中,我们展示了“Object-Region attention” 视频变压器(ORVIT),这是一个扩展视频变压器层的模块,并有一个块块,直接包含物体表达式。关键的想法是从早期的层开始将物体中心表示式结合到变压器层中,从而影响整个网络的脉冲-时态表达式。我们的 ORVIT 块由两个对象级流组成: 外观和动态。在外观流中,一个“ Object-Region 注意” 模块对补丁和\emph{obectreatreactions。 通过这种方式,视觉物体区域与统一的补置符号互动,并用背景化对象信息来丰富它们。我们通过一个单独的模型“Object- Dyncreute-deal” 模块来构建物体动态互动, 并显示两流。我们在四个任务变压变压变压变压的模型中演示的模型中, 我们在4个任务和五个变压变压变压的模型中, 显示了一种变压的动作结构的变压图解算法结构的动作结构的动作结构中, 显示了EVSOmasal-sal- sal 表达式表达式表达式表达式表达式表达式表达式动作, 在四个式动作中, 在Stoal 显示了E- scial actional actions actional 和Actions actions cumental cumental cumental actions a a a a a action sal action sal actions actions action sal actions sal action sal action sal action supsmental actionsmas sal actionsmas sups recumental action action sations sations sations sations sations sations sations sal actions sal actions sations sations recumental supal sal a