In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.
翻译:在这项工作中,我们追求一种统一的多式联运预培训模式,以打破复杂任务/现代特定任务/特定任务/适应性定制的脚架。我们建议FOA,即支持任务综合性的任务-不可知性和模式-不可知性框架。FOA统一了一套多样化的跨模式和单一模式任务,包括图像生成、视觉定位、图像字幕、图像分类、图像分类、语言建模等,并采用简单的顺序到顺序学习框架。FOA遵循在培训和微调阶段的教学,不需要为下游任务增加额外的任务层次。与最近依靠极为庞大的跨模式数据集的最先进的愿景和语言模型相比,OFA只对20M公开提供的图像-文本组合进行了预先培训。尽管OFA简单而相对小规模的培训数据,但它在一系列跨模式的任务中实现了新的SOTA,同时在单式任务上实现了高度竞争性的绩效。我们进一步的分析表明,OFA也可以有效地将任务/OFA的公开模式和领域转移。