In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption) corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, given an image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption (or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similar to the input one. This optimal image (or caption) is produced via a generative network, after an exploration by a genetic algorithm. Promising results are shown, based on the experimentation of the image Generators BigGAN and StyleGAN2, and of the text Generator GPT2
翻译:在这一研究工作中,我们展示了CLIP-GLaSS,这是一个用于生成与给定标题(或图像)相应的图像(或图文)的新颖零光框架。 CLIP-GLASS以CLIP神经网络为基础,通过图像和描述性字幕提供类似的嵌入。不同的是,CLIP-GLASS将一个字幕(或图文集)作为一种输入,并生成其CLIP嵌入与输入最相似的图像(或字幕)。这种最佳图像(或字幕)是在基因算法探索后通过基因网络生成的。根据图像生成器BigGAN和SysteleGAN2以及文本生成器GPT2的实验,展示了有希望的结果。