Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.
翻译:时序动作检测旨在预测视频中动作实例的类别和时间间隔。尽管取得了不俗的性能,现有的双流模型由于依赖计算成本高昂的光流而呈现出较慢的推断速度。本文通过交叉模态蒸馏提出一种分解框架,用于基于 RGB 数据建立强大的动作检测器,并传输运动模态的知识。具体而言,我们提出了分别学习 RGB 和运动表示的双分支设计,这些表示被结合起来进行动作定位。对称的训练目标和双分支设计有利于有效地传输运动知识,同时保持 RGB 信息不变。此外,我们引入了局部注意力融合,以更好地利用多模态互补性。此局部融合设计旨在维护特征的局部区分度,这对动作定位非常重要。在基准测试上进行的大量实验验证了所提方法增强 RGB 动作检测器的有效性。值得注意的是,我们的框架对骨干网和检测头均不敏感,可在不同的模型组合中带来持续的收益。