While transformers have shown great potential on video recognition tasks with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by self-attention operation on the huge number of 3D tokens in a video. In this paper, we propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarse-grained global dependencies between the query token and the coarse-grained global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratified strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
翻译:虽然变压器在视频识别任务上表现出巨大的潜力,具有捕捉长距离依赖的强大能力,但变压器往往会因在视频中大量3D表示的3D表示物上的自我注意操作而承受高昂的计算费用。在本文中,我们建议建立一个新的变压器结构,称为“DualFormer”,它能够有效和高效地提供空间时间关注视频识别。具体地说,我们的“双时间”战略将全时关注分为两个层层层层层层层层层层层层,即首先学习附近3D表示物之间微小的当地空间-时间互动,然后获取查询牌和粗差的全球金字塔环境之间粗差的全球性依赖。与应用空间因素化或限制当地窗口内对提高效率的计算方法不同,我们的本地-全球分层战略可以很好地捕捉到短距离的偏移和长距离偏差的偏差层层层层层层层层层层层层层层层,同时大量减少关注量计算中的键子和值,以提高效率。实验结果显示,在五部F-9至4的图像基准值层层层宽度优于现有方法,在2000年/2015年的高度为最高精确度上,具体为100至100年的基点段段段段上。