Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
翻译:纯粹的视觉变压器结构对于短期视频分类和动作识别任务非常有效,然而,由于自我关注的四重复杂和缺乏感应偏差,变压器资源密集,数据效率低下; 长式视频理解任务扩大了变压器的数据和记忆效率问题,使当前方法无法在数据或记忆限制领域实施; 本文引入了高效的Spatio-时空注意网络(STAN),利用双流变压器结构来模拟静态图像特征和时间背景特征之间的依赖性。 我们提议的方法可以对单一GPU上长达两分钟的视频进行分类,提高数据效率,实现SOTA在多项长期视频理解任务上的绩效。