Multi-modal generation has been widely explored in recent years. Current research directions involve generating text based on an image or vice versa. In this paper, we propose a new task called CIGLI: Conditional Image Generation from Language and Image. Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt. We designed a new dataset to ensure that the text description describes information from both images, and that solely analyzing the description is insufficient to generate an image. We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. The code and dataset is available at https://github.com/vincentlux/CIGLI.
翻译:近年来广泛探索了多模式生成。 当前的研究方向包括根据图像生成文字或反之亦然。 在本文中,我们提议了一个新的任务,名为 CIGLI: 从语言和图像生成有条件图像。 这项任务不是根据文本生成方式生成图像,而是从文本生成方式生成图像,而是从文本描述和图像提示中生成图像。 我们设计了一个新的数据集,以确保文本描述能够描述来自两种图像的信息,而仅仅分析描述不足以生成图像。 然后我们提出了一个新的语言图像聚合模式,通过定量(自动)和定性(人)评估,改进两种既定基线方法的性能。 代码和数据集可以在 https://github.com/vicentlux/CIGLI上查阅。