Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6\times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS'14 dataset.
翻译:在视频的每个框中,对对象及其动作的视频识别理由进行分流。 一个好的流动识别模型可以捕捉长期动态和视频的短期变化。 不幸的是, 在大多数现有方法中,计算复杂性随着所考虑的动态长度的长度而线性或二次增长。 这个问题在以变压器为基础的结构中尤为突出。 为了解决这个问题, 我们通过内核的镜头重新配置视频变压器中的交叉关注, 并应用两种时间平滑的内核: 盒子内核或 Laplace内核。 由此而来, 流动的注意力从框架到框架的很多计算, 只需要不断更新每个框架的时间 。 基于这个想法, 我们建造TeSTra, 即一个温度平滑动的变压器, 需要任意的长输入, 并不断加固和计算顶层。 具体地说, 它比以滑风式变压成的变压器速度要快6美元。 并且由于时间跨度增加, TeSTRA 实现了从框架到框架中的大部分再利用每个框架的计算结果,, 只需要一个时间更新每个框架更新每个框架。