Vision transformers (ViT) have shown promise in various vision tasks including low-level ones while the U-Net remains dominant in score-based diffusion models. In this paper, we perform a systematical empirical study on the ViT-based architectures in diffusion models. Our results suggest that adding extra long skip connections (like the U-Net) to ViT is crucial to diffusion models. The new ViT architecture, together with other improvements, is referred to as U-ViT. On several popular visual datasets, U-ViT achieves competitive generation results to SOTA U-Net while requiring comparable amount of parameters and computation if not less.
翻译:视觉变压器(VIT)在各种愿景任务(包括低水平任务)中表现出了希望,而U-Net在基于分数的传播模型中仍然占据主导地位。在本文中,我们对基于VIT的传播模型结构进行了系统的经验研究。我们的结果表明,增加与VIT的超长连接(如U-Net)对于传播模型至关重要。新的VIT结构以及其他改进被称为U-VIT。 在几个流行的视觉数据集中,U-VIT在SOTA U-Net上取得了有竞争力的生成结果,同时需要可比数量的参数和计算,如果不是更少的话。