A comprehensive understanding of vision and language and their interrelation are crucial to realize the underlying similarities and differences between these modalities and to learn more generalized, meaningful representations. In recent years, most of the works related to Text-to-Image synthesis and Image-to-Text generation, focused on supervised generative deep architectures to solve the problems, where very little interest was placed on learning the similarities between the embedding spaces across modalities. In this paper, we propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces; for both image to text and text to image generations. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder; then we study the mapping from embedding space of one modality to embedding space of the other modality utilizing GAN and maximum mean discrepancy based generative networks. We, also demonstrate that our model learns to generate textual description from image data as well as images from textual data both qualitatively and quantitatively.
翻译:全面理解愿景和语言及其相互关系对于实现这些模式之间的基本异同至关重要,对于了解这些模式之间的基本异同,以及对于学习更普遍、更有意义的表达方式至关重要。近年来,大多数与文本到图像合成和图像到文本生成有关的作品大多侧重于有监督的基因深层结构,以解决问题,而对于学习不同模式中嵌入空间之间的相似之处则兴趣很小。在本文件中,我们提议了一种新的自我监督的深层次学习方法,以学习跨模式嵌入空间;图像到文字和文字到几代图像。在我们的方法中,我们首先利用基于 StackGAN 的自动编码模型以及基于LSTM 的文本-自动编码器在句级上的密集矢量表达方式;然后,我们从一种模式的嵌入空间研究如何将其他模式的空间嵌入到使用GAN 和基于最大平均值的变色网络。我们还表明,我们的模型学会从图像数据中生成文字描述,以及从定性和定量的文本数据中生成图像。