Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module (TAM) to generate video-specific temporal kernels based on its own feature maps. TAM proposes a unique two-level adaptive modeling scheme by decoupling dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a principled module and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets demonstrate that the TAM outperforms other temporal modeling methods consistently, and achieves the state-of-the-art performance under the similar complexity.
翻译:视频数据由于摄影机动、速度变化和不同活动等各种因素而具有复杂的时间动态。为了有效捕捉这种不同的运动模式,本文件介绍了一个新的时间适应模块(TAM),以根据自己的地貌地图生成视频特定的时间内核。TAM提出一个独特的两级适应型模型,将动态内核分离成一个对位置敏感的重要性地图和一个位置不变的聚合权重。重要地图由当地时间窗口学习,以捕捉短期信息,而聚合权重则从全球视角产生,以长期结构为重点。TAM是一个原则模块,可以并入2D CNNs,以产生一个强大的视频结构(TANNet),其额外计算成本非常小。关于动因-400和某些事物的大规模实验显示,TAM与其他时间模型方法的兼容性一致,并在类似复杂的情况下实现最先进的性能。