生成式视觉语言 Transformer with Masked 训练: MAGVLT (MAGVLT: Masked Generative Vision-and-Language Transformer)

While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.

翻译：尽管已经有大规模配对数据集进行多模态图像文本生成模型的开发，但是很少尝试通过单个模型生成图像和文本数据，而不是在一个固定模态的条件下生成另一个模态。在本文中，我们探索了一个能够生成图像与文本序列的统一生成式视觉语言（VL）模型。尤其是我们提出了一种基于非自回归遮盖预测的生成式 VL Transformer，名为 MAGVLT，并将其与自回归生成式 VL Transformer (ARGVLT) 进行了比较。与 ARGVLT 相比，提出的 MAGVLT 可以进行双向上下文编码，通过迭代的遮盖并行预测进行快速解码，并具有扩展的编辑功能，如图像和文本填充。为了对我们的 MAGVLT 进行严格的训练，我们结合了图像到文本、文本到图像和联合图像文本的遮盖预测任务。此外，我们设计了两个额外的任务，基于逐步展开的遮盖预测和选择性预测的两个图像文本对的混合。VL 基准测试下游生成任务的实验结果表明，即使没有使用单模态数据和网络，我们的 MAGVLT 在显著的推理速度提升的情况下也大幅胜过 ARGVLT。特别是，MAGVLT 通过一个中等大小的模型 (少于 500M 参数) 在 MS-COCO 的零-shot 图像到文本和文本到图像生成任务中取得了竞争性结果。