Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
翻译:块因果视频生成面临速度与质量的显著权衡:小型1.3B模型仅能达到16 FPS,而大型14B模型则低至4.5 FPS,迫使用户在响应速度与生成质量之间做出选择。块级联方法通过无需训练的并行化技术显著缓解了这一矛盾。我们的核心发现是:未来视频块的生成无需等待当前块完全去噪完成。通过利用前驱块部分去噪的上下文信息启动块生成过程,我们将顺序处理流程转化为并行级联结构,使多个块能够同时进行去噪。借助5个GPU实现的时间并行性,我们在所有模型规模上实现了约2倍的加速效果:1.3B模型从16 FPS提升至30 FPS,14B模型从4.5 FPS提升至12.5 FPS。除推理速度提升外,块级联方法还消除了交互式生成中上下文切换时的KV重缓存开销(约200毫秒)。针对多种块因果流程的广泛评估表明,从块因果推理切换到块级联推理时,生成质量未出现显著下降。项目页面:https://hmrishavbandy.github.io/block_cascading_page/