视觉诱因推理的精细区域提示微调 (Fine-Grained Regional Prompt Tuning for Visual Abductive Reasoning)

Visual Abductive Reasoning (VAR) is an emerging vision-language (VL) topic where the model needs to retrieve/generate a likely textual hypothesis from a visual input (image or part of an image) using backward reasoning based on prior knowledge or commonsense. Unlike in conventional VL retrieval or captioning tasks, where entities of texts appear in the image, in abductive inferences, the relevant facts about inferences are not directly visible in the input images. Besides, the inferences are causally relevant to regional visual hints and vary with the latter. Existing works highlight visual parts from a global background with specific prompt tuning techniques (e.g., colorful prompt tuning) on top of foundation models, like CLIP. However, these methods uniformly patchify "regional hints" and "global context" at the same granularity level and may lose fine-grained visual details significant for abductive reasoning. To tackle this, we propose a simple yet effective Regional Prompt Tuning, which encodes "regional visual hints" and "global contexts" separately at fine and coarse-grained levels. Specifically, our model explicitly upsamples, then patchify local hints to get fine-grained regional prompts. These prompts are concatenated with coarse-grained contextual tokens from whole images. We also equip our model with a new Dual-Contrastive Loss to regress the visual feature simultaneously toward features of factual description (a.k.a. clue text) and plausible hypothesis (abductive inference text) during training. Extensive experiments on the Sherlock dataset demonstrate that our fully fine-tuned RGP/RGPs with Dual-Contrastive Loss significantly outperforms previous SOTAs, achieving the 1 rank on abductive reasoning leaderboards among all submissions, under all metrics (e.g., P@1$_{i->t}$: RGPs 38.78 vs CPT-CLIP 33.44, higher=better). We would open-source our codes for further research.

翻译：视觉诱因推理（VAR）是一种新兴的视觉语言（VL）领域，其中模型需要使用基于先验知识或常识的向后推理从视觉输入（图像或图像的一部分）中检索/生成可能的文本假设。与传统的VL检索或字幕任务不同，在诱导推理中，有关推理的相关事实在输入图像中不直接可见。此外，推论与区域视觉提示因果相关并随之变化。现有的作品通过特定的提示微调技术（例如，彩色提示微调）强调了来自全局背景的视觉部分。但是，这些方法统一地将“区域提示”和“全局上下文”在相同的粒度级别上制作补丁，可能会失去对诱导推理有重要意义的细粒度视觉细节。为解决这个问题，我们提出了一种简单而有效的区域提示微调方法，可以在细和粗粒度级别上分别对“区域视觉提示”和“全局上下文”进行编码。具体而言，我们的模型明确地将本地提示上采样，然后制作细粒度的区域提示。这些提示与整个图像的粗粒度上下文令牌连接。我们还为我们的模型配备了新的双对比损失，在训练期间同时回归视觉特征到事实描述（即线索文本）和合理的假设（诱导推理文本）的特征。对 Sherlock 数据集进行的大量实验表明，我们的完全微调的 RGP/RGPs（使用双对比性损失）显着优于以前的 SOTA，在所有度量标准下（例如，P@1$_{i->t}$: RGPs 38.78 vs CPT-CLIP 33.44，越高越好），在所有提交中在诱导推理排行榜上排名第一。我们将开源我们的代码以供进一步研究。