Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific temporal kernels based on its own feature map. TAM proposes a unique two-level adaptive modeling scheme by decoupling the dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short-term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a modular block and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently, and achieves the state-of-the-art performance under the similar complexity. The code is available at \url{ https://github.com/liu-zhy/temporal-adaptive-module}.
翻译:视频数据由于摄影机动、速度变化和不同活动等各种因素而具有复杂的时间动态。 为了有效捕捉这种不同的运动模式,本文件展示了一个新的时间适应模块(~bf TAM}),以根据自己的地貌地图生成视频特定的时间内核。 TAM提出一个独特的两级适应模型,将动态内核分离成一个敏感位置重要位置的地图和一个不易变聚合的重量位置。重要地图由当地时间窗口学习,以捕捉短期信息,而汇总权重则从全球视角产生,以长期结构为重点。TAM是一个模块块,可并入2D CNN,以产生一个强大的视频结构(TATNet),并产生非常小的额外的计算成本。关于动因学-400和某些东西的大规模实验表明,我们的TAM与其他时间模型方法一致,并在类似的复杂情况下实现状态的性能。代码可以在\url{ https://github.com/lizhy/state}