Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.


翻译:基于Transformer的视频扩散模型(VDMs)在视频生成质量上达到了最先进水平,但其自注意力机制的二次计算成本限制了长序列和高分辨率场景下的应用,导致计算开销巨大。虽然线性注意力提供了次二次复杂度,但先前的方法在未经过大量计算资源重新训练的情况下,难以匹配softmax注意力的表达能力。本文提出注意力手术,一种高效框架,可在预训练的VDMs中实现线性或混合注意力,无需从头训练。受语言模型最新进展启发,该方法结合了一种新颖的混合注意力机制——融合softmax与线性token——以及一个仅需数GPU天的轻量级蒸馏与微调流程。此外,我们引入了一种成本感知的块率策略,以平衡各层的表达能力与效率。将注意力手术应用于最先进的高效Transformer VDM模型Wan2.1 1.3B,并在VBench、VBench2.0及人类偏好研究中评估,结果显示其取得了有竞争力的性能。进一步地,在移动端延迟、内存使用和浮点运算次数(FLOPs)的测量中,该方法在长视频的扩展行为上表现出显著改进。项目页面位于:https://qualcomm-ai-research.github.io/attention-surgery。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员