While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ~1000G inference FLOPs which is at least 3.2x fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer.
翻译:虽然变压器在视频识别方面表现出巨大的潜力,具有捕捉长距离依赖的强大能力,但在视频识别方面却表现出巨大的潜力,但是它们往往会承受由于自我注意而引发的高昂计算费用。 在本文中,我们展示了一个新的变压器结构,称为“双时间”结构,它可以有效地提供时空注意力,以便视频识别。具体地说,“双时间”将全时注意力分解成双层层次,即首先了解附近3D牌的当地细微互动,然后获取查询牌和全球金字塔环境之间的粗度全球依赖性。与为提高效率而在当地窗口中应用时因因因素化或限制注意力计算的现有方法不同,我们的地方全球分层战略可以很好地捕捉到短距离的时空视依赖性依赖性,同时大大降低调计算中用于提高效率的钥匙和价值的数量。实验结果验证了DaleFormer公司在5个视频基准上比现有方法优越的优势。特别是,Dalferencereserences 82.9%-85.2% 和Finalision G.