Recent advances in deep learning, such as powerful generative models and joint text-image embeddings, have provided the computational creativity community with new tools, opening new perspectives for artistic pursuits. Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point. These images are generated with a latent vector that is progressively refined to agree with text cues. To do so, patches are sampled within the generated image, and compared with the text prompts in the common text-image embedding space; The latent vector is then updated, using gradient descent, to reduce the mean (average) distance between these patches and text cues. While this approach provides artists with ample freedom to customize the overall appearance of images, through their choice in generative models, the reliance on a simple criterion (mean of distances) often causes mode collapse: The entire image is drawn to the average of all text cues, thereby losing their diversity. To address this issue, we propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts. We provide numerous illustrations showing that OT avoids some of the pitfalls arising from estimating vectors with mean distances, and demonstrate the capacity of our proposed method to perform better in experiments, qualitatively and quantitatively.
翻译:最近深层次学习的进展,例如强大的基因模型和联合文本图像嵌入等,为计算创作界提供了新的工具,为艺术追求开辟了新的视角。文本到图像合成方法通过从文本提示生成图像而运作,提供了一个实例。这些图像是用一种潜在的矢量生成的,逐渐加以完善,以便与文本提示一致。为此,在生成的图像中,与共同文本图像嵌入空间的文本提示相比较,对补丁进行了抽样; 潜潜载矢量随后利用梯度下降来更新,以减少这些补丁和文本提示之间的平均(平均)距离。虽然这一方法为艺术家提供了充分的自由,通过在发型模型中选择来定制图像的整体外观,但依赖简单的标准(距离)往往会导致模式崩溃:整个图像与所有文本提示的平均值相比,从而丧失了多样性。为了解决这一问题,我们建议使用在最佳传输(OT)文献中发现的匹配技术,从而能够忠实地反映这些补差和文本提示的某种图像,从而能够忠实地反映这些相近度之间的平均距离。我们提供了无数的图像,我们用质量模型展示了我们所拟议的距离。