Visual Commonsense Reasoning (VCR), deemed as one challenging extension of the Visual Question Answering (VQA), endeavors to pursue a more high-level visual comprehension. It is composed of two indispensable processes: question answering over a given image and rationale inference for answer explanation. Over the years, a variety of methods tackling VCR have advanced the performance on the benchmark dataset. Despite significant as these methods are, they often treat the two processes in a separate manner and hence decompose the VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is interrupted, rendering existing efforts less faithful on visual reasoning. To empirically study this issue, we perform some in-depth explorations in terms of both language shortcuts and generalization capability to verify the pitfalls of this treatment. Based on our findings, in this paper, we present a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution is the introduction of a novel branch, which serves as the bridge to conduct processes connecting. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As detailed in the experimental results, when equipped with our framework, these baselines achieve consistent and significant performance improvements, demonstrating the viability of processes coupling, as well as the superiority of the proposed framework.
翻译:视觉常识解释(VCR)被认为是视觉问答(VQA)的一个具有挑战性的延伸,被认为是视觉问答(VQA)的一个具有挑战性的延伸,旨在追求更高级别的视觉理解,它由两个不可或缺的过程组成:对特定图像的回答问题和答案解释的理由推论。多年来,处理VCR的各种方法提高了基准数据集的性能。尽管这些方法很重要,但它们往往以不同的方式对待这两个过程,从而将VCR分解成两个无关的VQA案例。因此,问题回答和理由推论之间的关键联系中断,使现有的努力不那么忠实于视觉推理。为实证研究这一问题,我们在语言捷径和一般化两方面都进行了一些深入的探索,以核实这一处理的缺陷。根据我们的研究结果,我们提出了一个插接和播放知识的强化框架,将问题的回答和推论进程结合起来。因此,关键的贡献是引入一个新分支,作为将现有的优越性推论进程连接起来的桥梁。为了经验研究研究,我们的框架是,在具有一定的模型和精确性的基准下,我们的框架是将现有的精确性基准和精确性地证明了这些基准。