How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting - literally just filling in a hole in a concatenated visual prompt image - turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated - 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.
翻译:在NLP的启发下,本文对视觉提示进行了研究:在测试时给新任务输入-输出图像示例和一个新的输入图像中,目标是根据给定实例自动生成输出图像。我们显示,将这一问题作为简单的图像涂色 — — 完全只是填充在相连接的视觉快速图像中的洞穴 — — 呈现出惊人的效果,只要油漆算法已经接受了关于正确数据的培训。我们用新数据集对蒙面自动编码进行了培训,我们从Arxiv的学术论文来源整理了 - 88k 无标签的数字。我们对这些预先训练的模型应用视觉提示,并展示了各种下游图像到图像任务的结果,包括地表分割、单个物体探测、颜色化、边缘探测等。