We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21.
翻译:我们提议了TubeR:一个用于时空视频动作探测的简单解决方案。与现有方法不同,这些方法既依赖于离线的行为者探测器,也依赖于手设计的行为者位置假设,如建议或锚,我们提议通过同时执行动作定位和单个代表的识别,在视频中直接检测一个动作管管管。TubeR学习了一组管式管式queries,并利用一个管式感应模块来模拟一个视频片段的动态空间-时空性质,这与在spatio-时间空间使用演员定位假说相比,有效地加强了模型能力。对于包含过渡状态或场景变化的视频,我们提议了一个有意识的背景分类头,以利用短期和长期的环境加强行动分类,以及一个动作开关回归头,以探测精确的时间动作范围。TubeR直接生成了具有不同长度的动作管子,甚至保持长视频片段的良好结果。TubeR超越了以前在常用动作检测数据集AVA、UCF101-24和JDB-21上使用的状态。