High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
翻译:高分辨率视频生成对于数字媒体和电影至关重要,但扩散模型的二次复杂度造成了计算瓶颈,使得实际推理难以实现。为解决此问题,我们提出了HiStream,一种高效的自回归框架,该系统性地从三个维度减少冗余:i) 空间压缩:在低分辨率下进行去噪,然后利用缓存特征在高分辨率下进行细化;ii) 时间压缩:采用固定大小的锚点缓存进行分块处理策略,确保稳定的推理速度;以及 iii) 时间步压缩:对后续基于缓存条件的分块应用更少的去噪步骤。在1080p基准测试中,我们的主要HiStream模型(i+ii)实现了最先进的视觉质量,同时与Wan2.1基线相比,去噪速度提升高达76.2倍,且质量损失可忽略不计。我们的更快变体HiStream+应用了所有三项优化(i+ii+iii),实现了相对于基线107.5倍的加速,在速度与质量之间提供了极具吸引力的权衡,从而使高分辨率视频生成既实用又可扩展。