Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on answering questions that have rare answers. In addition, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, which achieves similar performance to separately optimized single-task models. Our code will be publicly available at: https://github.com/j-min/VL-T5
翻译:视觉和语言学习的现有方法通常要求为每项任务设计特定任务的结构和目标。 例如,一个用于视觉问答的多标签回答分类器,一个用于查找表达理解的区域评分器,以及一个用于图像字幕的语言解码器等。 为了缓解这些麻烦,我们在这项工作中提议了一个统一框架,在单一结构中学习不同任务,具有相同的语言建模目标,即多式有条件文本生成,我们的模型学习根据视觉和文字投入在文本文本中生成标签。在7个流行的视觉和语言基准上,包括视觉回答、引用表达理解、视觉常识推理,其中多数以前是作为歧视任务建模的。为了减轻这些麻烦,我们建议了一个统一的框架,在单一结构中学习不同的任务,即同一语言建模目标,即多式有条件的文本生成。此外,我们的系统框架允许在一个单一结构中进行多任务化学习,包括视觉回答、参考表达理解、视觉常识解理推理,其中多数以前是作为歧视性任务的模型,我们的基因化方法(有一个单一统一的架构)可以与最近的特定任务状态相比。 此外,我们的基因缩缩写方法显示回答问题难解答问题的能力。 此外,我们的框架允许在一个单一的代码模型将实现相似的简单化的单一模型。