Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5
翻译:视觉和语言学习的现有方法通常要求为每项任务设计特定任务的结构和目标。 例如,一个用于视觉问答的多标签回答分类器,一个用于查找表达理解的区域评分器和一个用于图像字幕的语言解码器等。 为了缓解这些麻烦,我们在这项工作中提议了一个统一框架,在单一结构中学习不同任务,具有相同的语言建模目标,即多式有条件文本生成,我们的模型学习根据视觉和文字投入在文本文本中生成标签。在7个流行的视觉和语言基准上,包括视觉回答、引用表达理解、视觉常识推理,其中多数以前是作为歧视任务建模的。为了减轻这些麻烦,我们建议了一个统一的框架,在单一结构中学习不同的任务,即,即多式有条件的文本生成。此外,我们的基因化方法显示,在有难解答的问题上,我们的框架允许在单一的架构中进行多式任务学习,包括视觉回答、参考表达理解、视觉常识推理,大部分以前是作为歧视性任务的模型,我们的基因化方法(有一个单一的优化的单一的单项/单项模型)。 我们的代码:在 MAG/VD/可使用的单式的单项/Stask模式是可使用。