Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose Memorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce MAG-Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
翻译:帧级自回归模型已取得显著进展,能够实现与双向扩散模型相媲美的实时视频生成,并成为交互式世界模型与游戏引擎的基础。然而,当前长视频生成方法通常依赖窗口注意力机制,该机制会简单丢弃窗口外的历史上下文,导致灾难性遗忘与场景不一致;反之,保留完整历史则会产生难以承受的内存开销。为解决这一权衡问题,我们提出记忆与生成框架,该框架将记忆压缩与帧生成解耦为独立任务。具体而言,我们训练记忆模型将历史信息压缩为紧凑的键值缓存,并训练独立的生成器模型利用该压缩表示合成后续帧。此外,我们引入MAG-Bench以严格评估历史记忆保留能力。大量实验表明,MAG在保持标准视频生成基准测试竞争力的同时,实现了更优的历史场景一致性。