Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intend. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.
翻译:实践证明,从经过培训的语文模型中引入模块是建立视觉语言模型(VLM)的有效方法,用于图像说明或视觉问题回答等任务。然而,这些模型的产出很少符合用户对具体答案的理由。为了改进这种对齐并加强常识原因,我们提议基于人与机器生成数据的互动调整模式。我们的ILLUME执行以下循环:由于图像问答迅速,VLM抽样多个候选人理由,而人类评论家则通过优惠选择提供最低限度的反馈,用于微调。这一循环增加了培训数据,并逐渐将VLM的合理化能力与人类的意向相匹配。我们详尽的实验表明,ILLUME在使用受监督的标准微调的同时具有竞争力,而使用的培训数据则少得多,只需要最低限度的反馈。