Pseudo 3D 视觉常识理由具有多级信任优化的视野变形器 (Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning)

A framework performing Visual Commonsense Reasoning(VCR) needs to choose an answer and further provide a rationale justifying based on the given image and question, where the image contains all the facts for reasoning and requires to be sufficiently understood. Previous methods use a detector applied on the image to obtain a set of visual objects without considering the exact positions of them in the scene, which is inadequate for properly understanding spatial and semantic relationships between objects. In addition, VCR samples are quite diverse, and parameters of the framework tend to be trained suboptimally based on mini-batches. To address above challenges, pseudo 3D perception Transformer with multi-level confidence optimization named PPTMCO is proposed for VCR in this paper. Specifically, image depth is introduced to represent pseudo 3-dimension(3D) positions of objects along with 2-dimension(2D) coordinates in the image and further enhance visual features. Then, considering that relationships between objects are influenced by depth, depth-aware Transformer is proposed to do attention mechanism guided by depth differences from answer words and objects to objects, where each word is tagged with pseudo depth value according to related objects. To better optimize parameters of the framework, a model parameter estimation method is further proposed to weightedly integrate parameters optimized by mini-batches based on multi-level reasoning confidence. Experiments on the benchmark VCR dataset demonstrate the proposed framework performs better against the state-of-the-art approaches.

翻译：显示视觉常识解析(VCR)的框架需要选择一个答案,并进一步根据给定图像和问题提供理由说明理由,在给定图像和问题的基础上,图像包含所有事实进行推理,需要充分理解。以前的方法是在图像上应用检测器,以获得一组视觉对象,而没有考虑到这些物体在现场的确切位置,这不足以正确理解天体之间的空间和语义关系。此外,VCR样本非常多样,而且框架参数往往以微型屏障为基础接受亚光度培训。为了应对上述挑战,本文件为VCR提议了名为PPTMCO的假3D感知变异器,其多层次信心优化的假3D感知变异器。具体来说,采用图像深度代表对象的假3D(3D)位置,同时不考虑这些物体在图像中的确切位置,无法正确理解天体之间的空间和语义关系,并进一步加强视觉变色变异体框架的注意机制,在从回答词和对象之间的深度差异指导下,每个字形体都以假深度优化深度优化的深度比值标尺,每个字根据相关的深度推推推,采用更精确的模型推算。最优化的模型,根据模型推算,根据模型推算,根据模型推算更优化的模型推算,根据模型推算更精确的模型的模型推算方法,根据模型推算更精确的模型的模型的模型的模型的模型的模型的模型,根据模型的模型推,根据模型,根据模型推,根据模型推算方法,根据模型推算,根据模型推算,根据模型推算,根据模型推算出一个更好的模型推测算,根据模型推算。