We introduce Video Transformer (VidTr) with separable-attention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatio-temporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage. We then present VidTr which reduces the memory cost by 3.3$\times$ while keeping the same performance. To further optimize the model, we propose the standard deviation based topK pooling for attention ($pool_{topK\_std}$), which reduces the computation by dropping non-informative features along temporal dimension. VidTr achieves state-of-the-art performance on five commonly used datasets with lower computational requirement, showing both the efficiency and effectiveness of our design. Finally, error analysis and visualization show that VidTr is especially good at predicting actions that require long-term temporal reasoning.
翻译:我们引入视频变换器( VidTr ), 并使用可分离的视频分类。 与常用的 3D 网络比较, VidTr 能够通过堆叠的注意力集聚spatio- 时空信息, 并以更高的效率提供更好的性能。 我们首先引入了香草视频变压器, 并显示变压器模块能够从原始像素中进行时空建模, 但使用大量内存。 然后我们推出 VidTr, 将内存成本降低3.3 $\ time $, 并同时保持同样的性能。 为了进一步优化模型, 我们建议基于顶级K 共享的标准偏差, 以备注意 ($pool ⁇ topK ⁇ std} $), 通过在时空尺寸中丢弃非信息特性来降低计算。 VidTr 在5个常用的数据集上实现最新性能, 低计算要求, 显示我们设计的效率和效力。 最后, 错误分析和直观显示 VidTr 在预测需要长期时间推理的动作方面特别优秀 。