This paper presents a unified multimodal pre-trained model called N\"UWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate N\"UWA on 8 downstream tasks. Compared to several strong baselines, N\"UWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.
翻译:本文介绍了一个名为N\"UWA(UWA)的统一多式联运预培训模式,该模式可以为各种视觉合成任务生成新的或操纵现有视觉数据(即图像和视频)。为了同时覆盖语言、图像和视频,设计了一个用于不同情景的3D变压器编码器-解码器框架,不仅可以将视频分别作为3D数据处理,还可以将文本和图像作为1D和2D数据加以调整。还提议了一个3D近距离注意(3DNA)机制,以考虑视觉数据的性质并减少计算的复杂性。我们还评估了N\“UWA”的8个下游任务。与几个强有力的基线相比,N\“UWA”在文本到图像生成、文本到视频生成、视频预测等方面实现了最新的最新成果。此外,它还显示了文本指导图像和视频操纵任务方面令人惊讶的零射能力。项目雷波是 https://github.com/microft/NUWA。