Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it's still not explored for vision and multimodal tasks. In this work, we introduce MultiInstruct, the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks covering 11 broad categories. Each task is designed at least with 5,000 instances (input-out pairs) from existing open-source datasets and 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to improve its performance, we explore multiple transfer learning strategies to leverage the large-scale Natural Instructions dataset. Experimental results demonstrate its strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from text-only instructions. We also design a new evaluation metric: Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that the model is less sensitive to the varying instructions after finetuning on a diverse set of tasks and instructions for each task.
翻译:指导调整是一种新的学习模式,它微调通过指示对各种自然语言处理任务进行预先培训的语言模式,显示各种自然语言处理任务有希望的零点表现。然而,它还没有为愿景和多式联运任务进行探索。在这项工作中,我们引入了由47种不同的多式联运任务组成的第一套多式指示调整基准数据集,涵盖11大类。每项任务的设计至少有5 000个现有开放源数据集和5个专家写指令的事例(投入对等)。我们把FOA作为经过预先培训的多式指示调整的基础模式,并且为了改进它的业绩,我们探索多种转让学习战略,以利用大型自然指令数据集。实验结果表明它在各种看不见的多式任务上的强烈零点表现,以及从只使用文字的指示中学习转移的好处。我们还设计了一个新的评价指标:感应度,以评价模型对各种指示的敏感性。我们的结果表明,该模式在对各项任务的不同任务和指示进行微调之后,对不同的指示不太敏感。