Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.
翻译:虽然深层次的基因模型引起了人们的极大关注,但大部分现有作品是为单式生成而设计的。在本文中,我们探索了无条件的图像-文本生成的新方法。我们设计了多式交叉量化VAE(MXQ-VAE),这是一个用于共同图像-文本表达的新型矢量量量化器(MXQ-VAE),这是一个用于共同图像-文本表达的新型矢量量量量化器,我们发现一个共同图像-文本代表空间对于在结构上一致的图像-文本生成具有效力。为了在一个四分化的空间学习多式的语义性关联性,我们把VQ-VAE与一个变异器编码器编码器结合起来,并应用了一个输入掩码策略。具体地说,MXQVAE接受一个掩码式图像-文本配对作为输入,并学习一个四分化的联合代表空间,以便将输入转换成一个统一的代码序列,然后我们用代码来进行无条件的图像-文本生成。广泛的实验显示四分化联合空间与合成和真实世界数据集的多式联运生成能力之间的相互关系。此外,我们还展示了我们在这两个方面的方法优于几个基线上。源代码:http/MVAVAVAVAVA/MQ。公开提供。源代码。