PixelDiT：用于图像生成的像素扩散变换器 (PixelDiT: Pixel Diffusion Transformers for Image Generation)

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

翻译：潜在空间建模一直是扩散变换器（DiTs）的标准方法。然而，它依赖于一个两阶段流程，其中预训练的自编码器引入了有损重构，导致误差累积并阻碍联合优化。为解决这些问题，我们提出了PixelDiT，一种单阶段、端到端的模型，无需自编码器，直接在像素空间中学习扩散过程。PixelDiT采用完全基于变换器的架构，通过双层级设计实现：一个捕捉全局语义的补丁级DiT和一个细化纹理细节的像素级DiT，从而在保留精细细节的同时实现像素空间扩散模型的高效训练。我们的分析表明，有效的像素级令牌建模是像素扩散成功的关键。PixelDiT在ImageNet 256x256上实现了1.61的FID，大幅超越了现有的像素生成模型。我们进一步将PixelDiT扩展到文本到图像生成，并在像素空间中以1024x1024分辨率进行预训练。它在GenEval上达到0.74分，在DPG-bench上达到83.5分，接近最佳潜在扩散模型的性能。