Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed.
翻译:自回归变压器已经在视频生成方面取得了显着的成功。然而,由于注意力机制的二次复杂度,变压器无法直接学习视频中的长期依赖关系,并且由于自回归过程,存在慢推理时间和错误传播的问题。在本文中,我们提出了内存高效的双向变压器(MeBT)用于端到端学习视频的长期依赖关系和快速推理。基于最近双向变压器的进展,我们的方法学习从部分可见补丁中并行解码整个时空视频体积的能力。通过将可观察的上下文令牌投影到固定数量的潜在令牌中,并通过交叉注意力对其进行编码和解码的条件,我们的方法在编码和解码过程中均实现了线性的时间复杂度。由于具有线性复杂度和双向建模的能力,我们的方法在生成中等长度视频的品质和速度上都有显着的改进,超过了自回归变压器。