Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
翻译:反向信号销毁过程以生成新数据的传播模型,通常要求每步的信号都具有同样的维度。我们争辩说,考虑到图像信号的空间冗余,在进化过程中,特别是在早期生成阶段,不需要保持高度的维度。为此,我们通过信号分解对前传播过程进行理论的概括化;具体地说,我们设法将图像分解成多个正方形组件,并在扰动图像时控制每个组件的衰减。这样,随着噪音强度的增加,我们就能减少那些不连续的部件,从而使用低维信号来代表源,很少丢失信息。这种重新组合使得在培训和传播模型的推导力方面各有不同层面。对一系列数据集的广泛实验表明,我们的方法大大降低了计算成本,实现了与基线方法相比的连续或甚至更好的合成性表现。我们还表明,我们的战略促进了高分辨率的图像合成,改进了在10.40至10时间的公用FFHQ代码上所培训的10.40至10秒的传播模型的FFFHQ 10 分辨率将从公用10到公用的公用的1024 分辨率模型。