视觉内容学习的好例子是什么? (What Makes Good Examples for Visual In-Context Learning?)

Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.

翻译：在广泛数据方面受过培训的大型模型最近由于其强有力的概括性表现而成为计算机视觉的主流结构。在本文件中,主要重点是大型视觉模型(即通识性学习)的突现能力,这种能力允许在不更新模型参数的情况下,以内流实例(a.k.a.~prompt)为基础,对无形任务进行推断。这一概念在自然语言处理中广为人知,但只是在最近才对大型视觉模型进行了研究。我们首次对计算机视觉中文本实例的影响进行了全面调查,发现这种性能对选择内流实例非常敏感。为了克服问题,我们提出了一个迅速检索框架,以自动选择内流示例实例。具体地说,我们提出(1) 一种未经监督的迅速检索方法,其基础是使用现成模型进行最接近的示例搜索,以及(2) 一种监督性的快速检索方法,用于培训神经网络,以选择能直接最大限度地增加同源学习绩效的示例。结果表明,我们的方法可以带来非端改进,使视觉选择中文本的随机学习比较。