Temporal modeling is crucial for capturing spatiotemporal structure in videos for action recognition. Video data is with extremely complex dynamics along temporal dimension due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module (TAM) to generate video-specific kernels based on its own feature maps. TAM proposes a unique two-level adaptive modeling scheme by decoupling dynamic kernels into a location insensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a principled module and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 demonstrate that TAM outperforms other temporal modeling methods consistently owing to its adaptive modeling strategy. On Something-Something datasets, TANet achieves superior performance compared with previous state-of-the-art methods. The code will be made available soon at https://github.com/liu-zhy/TANet.
翻译:为了有效捕捉这种不同的运动模式,本文件介绍了一个新的时间适应模块(TAM),以根据自己的地貌地图生成视频特定的内核。TAM提出一个独特的两级适应型建模方案,将动态内核分离成敏感重要地图中的位置和不易变汇总重量的位置。重要地图在本地时间窗口中学习,以捕捉短期信息,而汇总权重则从全球观点中产生,重点是长期结构。TAM是一个有原则的模块,可以整合成2DCNNs,产生一个强大的视频结构(TANet),其计算成本非常小。Kinitics-400的广泛实验表明,TAM由于适应性模型战略,与其他时间模型方法相形形色。关于Something数据设置,TANet将很快实现优异性业绩,与以前的州-li/Net-comart方法相比。