High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
翻译:高分辨率视频生成对于数字媒体和电影至关重要,但扩散模型的二次计算复杂度造成了计算瓶颈,使得实际推理难以实现。为解决此问题,我们提出了HiStream,一种高效的自回归框架,该系统性地在三个维度上减少冗余:i) 空间压缩:先以低分辨率去噪,再利用缓存特征在高分辨率下细化;ii) 时间压缩:采用分块策略并配合固定大小的锚点缓存,确保稳定的推理速度;以及 iii) 时间步压缩:对后续依赖缓存条件的块应用更少的去噪步骤。在1080p基准测试中,我们的主要HiStream模型(i+ii)实现了最先进的视觉质量,同时与Wan2.1基线相比,去噪速度提升高达76.2倍,且质量损失可忽略不计。我们的更快变体HiStream+应用了全部三项优化(i+ii+iii),实现了相对于基线107.5倍的加速,在速度与质量之间提供了极具吸引力的权衡,从而使高分辨率视频生成变得既实用又可扩展。