In this research work we present GLaSS, a novel zero-shot framework to generate an image(or a caption) corresponding to a given caption(or image). GLaSS is based on the CLIP neural network which given an image and a descriptive caption provides similar embeddings. Differently, GLaSS takes a caption (or an image) as an input, and generates the image (or the caption) whose CLIP embedding is most similar to the input one. This optimal image (or caption) is produced via a generative network after an exploration by a genetic algorithm. Promising results are shown, based on the experimentation of the image generators BigGAN and StyleGAN2, and of the text generator GPT2.
翻译:在此研究工作中,我们展示了“GLaSS”,这是一个用于生成与给定标题(或图像)相对应的图像(或字幕)的新颖零弹框架。“GLaSS”以CLIP神经网络为基础,提供图像和描述性字幕提供类似的嵌入。不同的是,“GLaSS”使用一个标题(或图像)作为输入,生成其 CLIP 嵌入与输入最相似的图像(或字幕)。这一最佳图像(或字幕)是在基因算法探索后通过基因网络生成的。根据图像生成者BigGAN和StyleGAN2以及文本生成者GPT2的实验结果,以及文本生成者GPT2的实验,展示了有希望的结果。