In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.
翻译:在本工作中,我们探索了扩散模型推理中一个未被利用的信号。尽管以往所有方法在推理时独立生成图像,我们转而探究样本是否能够协作生成。我们提出了群体扩散(Group Diffusion),解锁了注意力机制,使其能够在图像之间共享,而不仅仅局限于单张图像内部的图像块。这使得图像能够在推理时联合去噪,同时学习图像内部和图像之间的对应关系。我们观察到一个明显的缩放效应——更大的群体规模会产生更强的跨样本注意力,并带来更好的生成质量。此外,我们引入了一个定性度量来捕捉这种行为,并表明其强度与FID(弗雷歇起始距离)密切相关。基于标准的扩散变换器(Diffusion Transformers),我们的GroupDiff在ImageNet-256x256数据集上实现了高达32.2%的FID改进。我们的工作揭示了跨样本推理作为一种有效且先前未被探索的生成建模机制。