Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO
翻译:扩散模型在文本到图像合成等生成任务中取得了令人瞩目的成果,但往往难以完全使输出与细微的用户意图对齐,并保持一致的审美质量。现有的基于偏好的训练方法(如扩散直接偏好优化)有助于解决这些问题,但依赖于成本高昂且可能存在噪声的人工标注数据集。在本工作中,我们提出了直接扩散分数偏好优化,该方法在获胜与失败策略可用时,直接从这些策略中推导出每个时间步的监督信号。与先前仅基于最终样本进行操作的方法不同,DDSPO 在整个去噪轨迹上提供了密集的、过渡层面的信号。在实践中,我们通过使用预训练的参考模型自动生成偏好信号,从而避免了对标注数据的依赖:我们对比该模型在原始提示条件下与在语义降级变体提示条件下的输出。这一实用策略实现了有效的分数空间偏好监督,而无需显式的奖励建模或人工标注。实证结果表明,DDSPO 改善了文本-图像对齐度和视觉质量,其性能优于或匹配现有的基于偏好的方法,同时所需的监督信号显著减少。我们的实现代码可在以下网址获取:https://dohyun-as.github.io/DDSPO