Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforementioned fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of a human by proposing Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation. AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary-matching architectures (i.e., CNN-based and GCN-based) and two classifiers (i.e., Unet and P-GCN). Our AEI robustly outperforms the state-of-the-art methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection.
翻译:人类通常通过演员和周围环境之间的相互作用在视频中看到一项行动。行动只有在视频中的主要演员开始与环境互动时才开始,而主要演员停止互动才结束。尽管在时间行动提案的产生方面取得了巨大进展,但大多数现有作品忽略了上述事实,而留下示范学习作为黑箱提出行动的建议。在本文件中,我们试图模拟人类的这种能力,方法是提出行动环境互动网络,以改进时间行动提案的视频代表制。AEI包含两个模块,即基于视觉的视觉代表制(PVR)和边界匹配模块(BMM)。PVR代表每个视频片段,方法是利用拟议的适应关注机制,将人与人之间的关系和人类与环境的关系纳入考虑。然后,BMM公司通过视频代表来提出行动提案。AEI在活动网络1.3和THUMOS-14数据集中进行了全面评价,关于时间行动提案和探测任务,包括两个边界匹配结构(i、CN基础的和GCN的高级行动方法),以及两个边界匹配结构(即常规的和常规的PCN-A-S-A-S-A-S-S-ARI)和两州级的动态行动方法。