This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.
翻译:本文概述了一个使用扩散基因模型的端到端优化丢失图像压缩框架。 这种方法依赖于变换编码模式, 将图像映射成一个用于加密编码的潜在空间, 并从此映射回到重建的数据空间。 与基于 VAE 的神经压缩相比, 我们的解码器是一个确定性神经网络, 我们的解码器是一个有条件的传播模型。 因此, 我们的方法引入了一个额外的“ 内容” 潜在变量, 该变量是反向扩散进程的条件, 并使用该变量存储图像信息。 其余的“ 文本” 潜在变量在解码时被合成( 随机或确定性地) 。 我们显示, 该模型的性能可以调整到感知性指标。 我们涉及五个数据集和十六个图像质量评估指标的广泛实验显示, 我们的方法生成了最强的FID分数, 同时在几个基于SIM的参考指标中, 产生与最先进的模型具有竞争力的性能。