Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like to create general purpose vision systems that can learn and perform a range of tasks without any modification to the architecture or learning process. In this paper, we propose GPV-1, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more. We also propose evaluations of generality of architecture, skill-concept transfer, and learning efficiency that may inform future work on general purpose vision. Our experiments indicate GPV-1 is effective at multiple tasks, reuses some concept knowledge across tasks, can perform the Referring Expressions task zero-shot, and further improves upon the zero-shot performance using a few training samples.
翻译:为了减少开发新应用所需的时间和专门知识,我们希望建立通用的视觉系统,这些系统能够学习和完成一系列任务,而无需对结构或学习过程作出任何修改。本文提议GPV-1, 一种任务性、不可知性、视觉语言结构,能够学习和完成包括接收图像和制作文本和/或捆绑框在内的任务,包括分类、本地化、视觉问题回答、说明等等。我们还提议对结构的一般性、技能观念转移和学习效率进行评价,这些评价可能为今后关于一般目的设想的工作提供信息。我们的实验表明,GPV-1对多项任务有效,对跨任务或学习过程的一些概念知识进行再利用,可以执行参考表达任务零光,并用少数培训样本进一步改进零光性表现。