Efficiently modeling spatial-temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and the dense interaction modules such as non-local blocks. However, these methods cannot accurately fit the diverse events in videos. On the one hand, the adopted convolutions are with fixed scales, thus struggling with events of various scales. On the other hand, the dense interaction modeling paradigm only achieves sub-optimal performance as action-irrelevant parts bring additional noises for the final prediction. In this paper, we propose a unified action recognition framework to investigate the dynamic nature of video content by introducing the following designs. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer, which yields a sparse paradigm. We call the proposed framework as Event Adaptive Network (EAN) because both key designs are adaptive to the input video content. To exploit the short-term motions within local segments, we propose a novel and efficient Latent Motion Code (LMC) module, further improving the performance of the framework. Extensive experiments on several large-scale video datasets, e.g., Something-to-Something V1&V2, Kinetics, and Diving48, verify that our models achieve state-of-the-art or competitive performances at low FLOPs. Codes are available at: https://github.com/tianyuan168326/EAN-Pytorch.
翻译:在视频中高效模拟空间时空信息对于行动识别至关重要。 为了实现这一目标, 最先进的方法通常使用卷发操作员和非本地区块等密集互动模块。 但是, 这些方法无法准确匹配视频中的各种事件。 一方面, 采用的卷发是固定的, 因而与各种规模的事件相纠缠。 另一方面, 密集的互动模型模式只能达到亚优性性性能, 作为与行动相关的部分, 给最终预测带来更多的噪音。 在本文中, 我们建议一个统一的行动识别框架, 调查视频内容的动态性质, 采用以下设计 。 首先, 当提取本地提示时, 我们无法准确匹配视频中的各种事件。 第二, 将这些提示精确地汇总成全球视频演示。 我们提议将这些互动作为少数由变异器选择的表面对象进行, 产生一种稀疏的范例。 我们称拟议框架为“ 事件适应网络 ”, 因为两种关键设计都适应了投入视频内容的动态性质 。 我们生成了一个输入式视频模块 。 利用了多个图像模块 。 利用短期, 我们的模拟, 改进了一个输入式 。