视觉-声学亚学零热成像到图案生成 (Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic)

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests.

翻译：最近的文本到图像匹配模型将对比性学习应用到大量未精准的图像和句子组合中。虽然这些模型可以为匹配和随后的零射任务提供强大的分数, 但无法生成给图像的字幕。在这项工作中, 我们重新利用这些模型来生成描述性文字, 在推论时间给图像, 无需任何进一步的培训或调试步骤。这是通过将视觉- 语义模型与大型语言模型相结合的方式来完成的, 从两个网络规模模型的知识中受益。由此产生的字幕比通过监管的字幕方法获得的要少得多。此外, 作为一种零光化的学习方法, 它非常灵活, 我们展示其进行图像算术的能力, 其输入既可以是图像,也可以是文字, 输出是一个句子。这样可以让像比较两个图像或解决视觉类比测试这样的新的高层次视觉能力成为可能。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日