Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.
翻译:尽管深层的基因模型引起了人们的极大关注,但大部分现有作品是为单一方式的生成任务设计的。在本文中,我们探索了一种无条件图像-文本生成的新方法。我们提出了MXQ-VAE,这是多式图像-文本代表的矢量量化方法。MXQ-VAE接受了配对图像和文本作为投入,并学习了共同的量化表达空间,以便图像-文本能够转换成一系列统一的指数。然后,我们就可以使用自动递减型模型来模拟共同图像-文本表述,甚至进行无条件图像-文本生成。广泛的实验结果表明,我们的方法有效地生成了结构上一致的图像-文本配对,也加强了图像和文本之间的有意义的匹配。