Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. CoD is not a fixed codec but a general foundation model designed for various diffusion-based codecs. It offers several advantages: \textbf{High compression efficiency}, replacing Stable Diffusion with CoD in downstream codecs like DiffC achieves SOTA results, especially at ultra-low bitrates (e.g., 0.0039 bpp); \textbf{Low-cost and reproducible training}, 300$\times$ faster training than Stable Diffusion ($\sim$ 20 vs. $\sim$ 6,250 A100 GPU days) on entirely open image-only datasets; \textbf{Providing new insights}, e.g., We find pixel-space diffusion can achieve VTM-level PSNR with high perceptual quality and can outperform GAN-based codecs using fewer parameters. We hope CoD lays the foundation for future diffusion codec research. Codes will be released.
翻译:现有的扩散编解码器通常基于文本到图像的扩散基础模型(如Stable Diffusion)构建。然而,从压缩的角度来看,文本条件化并非最优选择,这限制了下游扩散编解码器的潜力,尤其是在超低码率场景下。为解决此问题,我们提出了\\textbf{CoD},这是首个面向\\textbf{Co}mpression(压缩)的\\textbf{D}iffusion(扩散)基础模型,通过从零开始训练,实现了压缩与生成能力的端到端优化。CoD并非一个固定的编解码器,而是一个为各类基于扩散的编解码器设计的通用基础模型。它具有以下优势:\\textbf{高压缩效率}——在下游编解码器(如DiffC)中用CoD替换Stable Diffusion可获得当前最优(SOTA)结果,尤其在超低码率下(例如0.0039 bpp);\\textbf{低成本且可复现的训练}——在完全开源的纯图像数据集上,其训练速度比Stable Diffusion快300倍(约20 vs. 约6,250个A100 GPU天);\\textbf{提供新的研究视角}——例如,我们发现像素空间扩散模型能够以高感知质量达到VTM级别的PSNR,并且能以更少的参数量超越基于GAN的编解码器。我们希望CoD能为未来扩散编解码器的研究奠定基础。相关代码将开源发布。