The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
翻译:流式视频生成的核心挑战在于在长上下文中保持内容一致性,这对记忆设计提出了较高要求。现有大多数解决方案通过预定义策略压缩历史帧来维持记忆。然而,待生成的视频片段应参考不同的历史线索,固定策略难以满足这一需求。本工作提出MemFlow以解决此问题。具体而言,在生成即将到来的片段前,我们通过检索与该片段文本提示最相关的历史帧来动态更新记忆库。这一设计使得即使未来帧中出现新事件或场景切换,也能保持叙事连贯性。此外,在生成过程中,我们仅激活记忆库中与注意力层每个查询最相关的标记,从而有效保障生成效率。通过这种方式,MemFlow以可忽略的计算负担(相比无记忆基线仅降低7.9%速度)实现了卓越的长上下文一致性,并保持与任何带KV缓存的流式视频生成模型的兼容性。