Video diffusion models have revolutionized generative video synthesis, but they are imprecise, slow, and can be opaque during generation -- keeping users in the dark for a prolonged period. In this work, we propose DiffusionBrowser, a model-agnostic, lightweight decoder framework that allows users to interactively generate previews at any point (timestep or transformer block) during the denoising process. Our model can generate multi-modal preview representations that include RGB and scene intrinsics at more than 4$\times$ real-time speed (less than 1 second for a 4-second video) that convey consistent appearance and motion to the final video. With the trained decoder, we show that it is possible to interactively guide the generation at intermediate noise steps via stochasticity reinjection and modal steering, unlocking a new control capability. Moreover, we systematically probe the model using the learned decoders, revealing how scene, object, and other details are composed and assembled during the otherwise black-box denoising process.
翻译:视频扩散模型已彻底改变了生成式视频合成,但其生成过程存在不精确、速度慢且透明度低的问题——导致用户在长时间内处于未知状态。本研究提出DiffusionBrowser,一个模型无关的轻量级解码器框架,允许用户在去噪过程中的任意时刻(时间步或Transformer模块)交互式生成预览。我们的模型能够生成多模态预览表示,包括RGB和场景本征信息,其速度超过实时速度的4倍(对于4秒视频的生成时间少于1秒),并能传达与最终视频一致的外观和运动特征。通过训练后的解码器,我们证明了可以通过随机性重注入和模态引导在中间噪声步骤中交互式指导生成过程,从而解锁新的控制能力。此外,我们利用学习到的解码器对模型进行系统性探测,揭示了在原本黑箱化的去噪过程中,场景、物体及其他细节是如何被组合与构建的。