Human-object interaction is one of the most important visual cues that has not been explored for egocentric action anticipation. We propose a novel Transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.
翻译:人类物体互动是最重要的视觉提示之一,对于以自我为中心的行动预言,尚未对此进行探索。我们提出了一个新型变异器,以模型互动为模型,计算由于执行行动而使物体和人类手外观变化的变化,并使用这些变化来改进视频演示。具体地说,我们用空间交叉感应(SCA)和进一步利用轨迹交叉感应(Trajotory Cross-Prevention)来模拟手和物体之间的互动,以获取环境再造互动符号。我们使用这些符号,为行动预言建立一个互动中心视频代言。我们称之为我们的InAViT模型,该模型在大型以自我为中心的数据集EPICKTICHENS100(EK100)和EGTEA Gaze+上实现最先进的行动预言性表现。AVIT优于其他以视觉变异器为基础的方法,包括以物体为中心的视频代言。在EK100评价服务器上,InAViT是公共领导板上(提交时)最优秀的表现方法,在平均5回顾时超过3.3%。