The relation modeling between actors and scene context advances video action detection where the correlation of multiple actors makes their action recognition challenging. Existing studies model each actor and scene relation to improve action recognition. However, the scene variations and background interference limit the effectiveness of this relation modeling. In this paper, we propose to select actor-related scene context, rather than directly leverage raw video scenario, to improve relation modeling. We develop a Cycle Actor-Context Relation network (CycleACR) where there is a symmetric graph that models the actor and context relations in a bidirectional form. Our CycleACR consists of the Actor-to-Context Reorganization (A2C-R) that collects actor features for context feature reorganizations, and the Context-to-Actor Enhancement (C2A-E) that dynamically utilizes reorganized context features for actor feature enhancement. Compared to existing designs that focus on C2A-E, our CycleACR introduces A2C-R for a more effective relation modeling. This modeling advances our CycleACR to achieve state-of-the-art performance on two popular action detection datasets (i.e., AVA and UCF101-24). We also provide ablation studies and visualizations as well to show how our cycle actor-context relation modeling improves video action detection. Code is available at https://github.com/MCG-NJU/CycleACR.
翻译:演员与场景上下文之间的关系建模推进了视频动作检测,由于多个演员之间的协同作用使他们的行为识别具有挑战性。现有的研究模型化了每个演员和场景之间的关系以改进动作识别。但是,场景变化和背景干扰限制了这种关系建模的有效性。在本文中,我们建议选择与演员相关的场景上下文,而不是直接利用原始视频场景,以改进关系建模。我们开发了一个CycleACR(Cycle Actor-Context Relation network)网络,其中有一个对称图表以双向形式建模演员和上下文关系。我们的CycleACR包括Actor-to-Context Reorganization(A2C-R),它收集演员特征以进行上下文特征重组,以及Context-to-Actor Enhancement(C2A-E),它动态利用重组的上下文特征进行演员特征增强。与现有的侧重于C2A-E的设计相比,我们的CycleACR引入了A2C-R以提高关系建模的有效性。此模型推进了CycleACR在两个常用的动作检测数据集(即AVA和UCF101-24)上实现了最先进的性能。我们还提供了消融研究和可视化展示,以展示我们的周期演员-上下文关系建模是如何改进视频动作检测的。代码可从https://github.com/MCG-NJU/CycleACR获取。