A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
翻译:多式联运研究的一个主要目标是提高机器对图像和文字的理解。 任务包括图像字幕、 文本到图像生成和视觉语言描述学习。 到目前为止, 研究的重点是图像和文字之间的关系。 例如, 标题模型试图理解图像的语义, 这些图像随后转换成文字。 一个重要的问题是: 哪个注释最能反映对图像内容的深刻理解? 同样, 文本, 什么样的图像最能显示文本的语义? 在这项工作中, 我们争辩说, 特定图像的最佳文字或文字标题是产生与该图像最相似的图像的文字。 同样, 给定文本的最佳图像是产生与原始文本最一致的字幕的图像。 为此, 我们提出了一个统一的框架, 包括文字到图像的基因描述模型和图像到文字的基因描述模型。 广泛的实验验证了我们的方法 。