Denoising diffusion models have recently marked a milestone in high-quality image generation. One may thus wonder if they are suitable for neural image compression. This paper outlines an end-to-end optimized image compression framework based on a conditional diffusion model, drawing on the transform-coding paradigm. Besides the latent variables inherent to the diffusion process, this paper introduces an additional discrete "content" latent variable to condition the denoising process on. This variable is equipped with a hierarchical prior for entropy coding. The remaining "texture" latent variables characterizing the diffusion process are synthesized (either stochastically or deterministically) at decoding time. We furthermore show that the performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and 16 image perceptual quality assessment metrics show that our approach not only compares favorably in terms of rate and perceptual distortion tradeoffs but also shows robust performance under all metrics while other baselines show less consistent behavior.
翻译:隐隐性扩散模型最近标志着高质量图像生成中的一个里程碑。 因此, 人们可能会怀疑它们是否适合神经图像压缩。 本文概述了基于有条件扩散模型的端到端优化图像压缩框架, 以变换编码模式为基础。 除了扩散过程固有的潜在变量外, 本文还引入了额外的离散的“ 内容” 潜在变量, 以设定脱色过程的条件。 此变量配有对映编码前的等级。 剩下的扩散过程“ 线性” 潜在变量在解码时会被合成( 无论是在结构上还是决定性上 ) 。 我们还表明, 性能可以调整到感知性的兴趣度指标。 我们涉及5个数据集和16个图像感知性质量评估指标的广泛实验显示, 我们的方法不仅在速度和感知性扭曲取舍上比较优异性, 而且在其他基线下显示强的性能, 而其他基线则显示不那么一致的行为。