In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
翻译:作为NLP中的一种新范例,内文学习使模型能够迅速适应各种任务,只有少量的提示和实例。但在计算机视野中,内文学习的困难在于任务在输出表达中差异很大,因此不清楚如何定义通用任务,使视觉模型能够理解并传输到外文任务。在这项工作中,我们展示了用“图像”为中心、以“形象”为中心的解决方案解决这些障碍的通俗模式,即将核心视觉任务的产出重新定义为图像,并指定任务和图像。有了这一理念,我们的培训过程非常简单,在投入和产出图像配对的缝合中执行标准掩码图像模型。这使得模型能够以可见图像补齐为条件执行任务。因此,在推论中,我们可以采用与输入条件相同的一对投入和输出图像的组合模型,以显示哪些任务需要完成。没有钟和口哨,我们的普通画家画家画家可以实现竞争性的绩效,而不能在高清晰度的图像处理模型中,在七个具有代表性的任务中,在高清晰度的图像处理模型中可以学习。