While diffusion models excel at image synthesis, useful representations have been shown to emerge from generative pre-training, suggesting a path towards unified generative and discriminative learning. However, suboptimal semantic flow within current architectures can hinder this potential: features encoding the richest high-level semantics are underutilized and diluted when propagating through decoding layers, impeding the formation of an explicit semantic bottleneck layer. To address this, we introduce self-conditioning, a lightweight mechanism that reshapes the model's layer-wise semantic hierarchy without external guidance. By aggregating and rerouting intermediate features to guide subsequent decoding layers, our method concentrates more high-level semantics, concurrently strengthening global generative guidance and forming more discriminative representations. This simple approach yields a dual-improvement trend across pixel-space UNet, UViT and latent-space DiT models with minimal overhead. Crucially, it creates an architectural semantic bridge that propagates discriminative improvements into generation and accommodates further techniques such as contrastive self-distillation. Experiments show that our enhanced models, especially self-conditioned DiT, are powerful dual learners that yield strong and transferable representations on image and dense classification tasks, surpassing various generative self-supervised models in linear probing while also improving or maintaining high generation quality.
翻译:尽管扩散模型在图像合成方面表现出色,但已有研究表明生成式预训练能够产生有用的表征,这为实现生成与判别学习的统一提供了可能。然而,当前架构中次优的语义流会阻碍这一潜力的发挥:编码最丰富高层语义的特征在通过解码层传播时未能得到充分利用且被稀释,从而阻碍了显式语义瓶颈层的形成。为解决这一问题,我们引入了自条件机制——一种无需外部指导即可重塑模型层级语义结构的轻量级方法。通过聚合并重定向中间特征以指导后续解码层,我们的方法能够集中更多高层语义,同时增强全局生成指导并形成更具判别性的表征。这一简单方法以极小的开销在像素空间的UNet、UViT以及潜在空间的DiT模型中实现了双重改进趋势。关键在于,它构建了一个架构语义桥梁,可将判别性改进传递至生成过程,并能兼容对比自蒸馏等进阶技术。实验表明,我们增强的模型(尤其是自条件DiT)是强大的双重学习器,在图像与密集分类任务中均能产生强可迁移的表征,在线性探测任务中超越多种生成式自监督模型,同时保持或提升了生成质量。