Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
翻译:时间行动提案的生成是一项重要而具有挑战性的任务,其目的是将包含未经剪辑的视频中人类行动的时隔定位为本地化。大多数现有办法都无法遵循人类认知过程来理解视频环境,因为缺乏关注机制来表达行动的概念或代理人与环境之间采取行动或相互作用的代理人。根据一个行动定义,即一个被称为代理人的人与环境互动并开展影响环境的行动,我们提议了一个背景代理环境网络。我们拟议的背景 AEN涉及:(一) 代理路径,在地方一级运行,以说明哪些人/代理人正在采取行动;(二) 环境路径,在全球一级运行,以说明这些代理人与环境的相互作用。关于20个动作THUMOS-14和200动作活动Net-1.3数据集的综合评价与不同的主干网络(即C3D和SlowFast)显示,我们的方法有力地表明,无论使用何种主干网络,都优于最先进的方法。