Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.
翻译:视频文本变压器是否学会对帧间的时间关系进行建模?尽管它们具有巨大的容量和丰富的多模态训练数据,但最近的研究揭示了视频文本模型对基于帧的空间表示的强烈倾向,而时间推理仍然未得到充分解决。在这项工作中,我们确定了视频文本变压器的时间学习面临的几个关键挑战:网络容量有限导致的时空权衡;多帧建模时面临的维度灾难;延长片段长度时语义信息的递减收益等。在这些发现的指导下,我们提出了SViTT,一种稀疏的视频文本架构,可以以比密集型变压器更低的成本进行多帧推理。与基于图形的网络类似,SViTT采用两种形式的稀疏性:边稀疏性限制自注意力中令牌之间的查询-密钥通信;节点稀疏性删除不相关的视觉令牌。使用逐步增加模型稀疏性的课程进行训练,SViTT在多个视频文本检索和问答基准测试中均优于密集型变压器基线,并且只需一小部分计算成本。项目页面:http://svcl.ucsd.edu/projects/svitt。