Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
翻译:近期研究已证明利用扩散模型生成交互式与可探索世界的潜力。然而,大多数现有方法面临参数规模过大、依赖冗长推理步骤以及历史上下文快速增长等关键挑战,严重限制了实时性能并缺乏文本控制生成能力。为解决这些问题,我们提出\method,这是一种新颖的框架,旨在从单张图像或文本提示生成逼真、交互式且连续的世界。\method通过精心设计的框架实现这一目标,支持基于键盘对生成世界进行探索。该框架包含三个核心组件:(1)集成统一上下文压缩与线性注意力的长视频生成框架;(2)由双向注意力蒸馏与增强文本嵌入方案驱动的实时流式加速策略;(3)用于生成世界事件的文本控制方法。我们已在补充材料中提供代码库。