Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we propose a novel ConvTransformer network for action detection. This network comprises three main components: (1) Temporal Encoder module extensively explores global and local temporal relations at multiple temporal resolutions. (2) Temporal Scale Mixer module effectively fuses the multi-scale features to have a unified feature representation. (3) Classification module is used to learn the instance center-relative position and predict the frame-level classification scores. The extensive experiments on multiple datasets, including Charades, TSU and MultiTHUMOS, confirm the effectiveness of our proposed method. Our network outperforms the state-of-the-art methods on all three datasets.
翻译:行动探测是一项至关重要且具有挑战性的任务,特别是对未剪辑的录像中贴有密集标签的数据集而言。在这些数据集中,时间关系是复杂的,包括复合动作和共生动作等挑战。为了探测这些复杂视频中的行动,有效捕捉视频中的短期和长期时间信息至关重要。为此,我们提议建立一个新型的ConvTransext网络以探测行动。这个网络由三个主要组成部分组成:(1) 时间编码模块在多个时间分辨率上广泛探索全球和地方时间关系。(2) 时间缩放混音模块有效地结合了多尺度的特征,以便具有统一的特征代表。(3) 分类模块用于学习实例中心位置并预测框架级分类分数。关于多个数据集的广泛实验,包括Charades、TSU和MixTHUMOS,证实了我们拟议方法的有效性。我们的网络超越了所有三个数据集上的最新方法。