Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1 & V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/
翻译:时间建模是高效视频动作识别的关键。 虽然理解时间信息可以提高动态行动的识别准确度, 消除时间冗余和重复使用过去的特征可以大大节省导致高效行动识别的计算。 在本文中, 我们引入了一个适应性时间聚合网络, 称为AdaFuse, 动态地段连接了当前和以往地段地图的通道, 以便形成强大的时间建模。 具体地说, 历史变迁地貌图中的必要信息与当前经修饰的地貌地图相结合, 目的是提高识别准确度和效率。 此外, 我们使用跳过操作来进一步降低行动识别的计算成本。 在V1 & V2、 Jester 和 Mini- Kinetics 上进行的广泛实验显示, 我们的方法可以实现大约40%的节约, 与最新方法的精确度相当。 项目页面可以在 https:// menguyest.github.io/AdaFuse/ 上找到。