While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.
翻译:虽然语言任务自然地在一个单一、统一、建模框架中表达,即生成象征序列,但在计算机视觉中却不是这样。因此,不同视觉任务的不同结构和损失功能激增。在这项工作中,我们显示,如果以共享像素到序列界面的方式制定一套不同的“核心”计算机愿景任务,就可以统一。我们侧重于四项任务,即物体探测、实例分割、关键点探测和图像说明,所有种类不同的产出,例如捆绑盒或密集遮罩。尽管如此,我们通过将每项任务的产出作为离散符号序列,并有一个统一的接口,我们表明,在所有这些任务上,可以培养具有单一模型结构和损失功能的神经网络,而没有具体任务定制。为了解决具体任务,我们用一个简短的提示作为任务描述,序列输出可以适应迅速,从而产生具体任务的产出。我们显示,这种模型能够实现与既定任务特定模型相比的竞争性性性。