Bootstrapping from pre-trained language models has been proven to be an efficient approach for building foundation vision-language models (VLM) for tasks such as image captioning or visual question answering. However, it is difficult-if not impossible-to utilize it to make the model conform with user's rationales for specific answers. To elicit and reinforce commonsense reasons, we propose an iterative sampling and tuning paradigm, called ILLUME, that executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.
翻译:实践证明,从经过培训的语文模型中引入引导是建立基本视觉语言模型(VLM)的高效方法,用于图像字幕或视觉问答等任务。然而,要使用模型使模型符合用户对具体答案的理由,即使并非不可能,也是困难的。为了获取和加强常识性理由,我们提议了一个迭代抽样和调试模式,称为ILLUME,该模式可实施以下循环:由于图像问答迅速,VLM抽样多个候选人理由,以及一位人类评论家通过选择优惠提供最低限度的反馈,用于微调。这一循环增加了培训数据,并逐渐将VLM的合理化能力绘制出来。我们详尽的实验表明,ILLUME与标准监管的微调具有竞争力,同时使用的培训数据要少得多,只需要最低限度的反馈。