Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.
翻译:近期研究显著提升了扩散变换器的训练效率。然而,这些技术大多被孤立研究,尚未探索多种方法结合的潜在协同效应。我们提出SR-DiT(速度优化扩散变换器)框架,在表征对齐基础上系统整合了令牌路由、架构改进与训练优化。该方法仅使用1.4亿参数模型在40万次迭代中(无需分类器无关引导)即实现ImageNet-256上FID 3.49与KDD 0.319的指标,与训练时长显著增加的6.85亿参数模型结果相当。据我们所知,这是该模型规模下的最优结果。通过大量消融实验,我们明确了最有效的技术组合,并记录了协同效应与不兼容性。我们将框架开源,为未来研究提供计算友好的基线。