Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.
翻译:视觉语言演示学习的变异器获得了很大的兴趣,在视觉问题解答(VQA)和定位上表现出了巨大的表现。但是,大多数显示这些任务良好表现的系统在培训期间仍然依赖预先训练的物体探测器,这限制了这些探测器对用于这些探测器的物体类别的适用性。为了减轻这一限制,以下论文侧重于变异器视觉问题解答中受监视地面地面测量薄弱的问题。该方法通过将视觉编码器中的每个视觉标记组合在一起来利用胶囊,并使用语言自我注意层的激活作为文本制导选择模块来遮盖这些胶囊,然后将其传送到下一个层。我们评估了我们在具有挑战性的GQA和VQA-HAT数据组用于VQA定位的方法。我们的实验表明:虽然从标准变异器结构中去除遮蔽物体的信息会导致显著的性能下降,但胶囊的集集整合大大提高了这些系统的地面定位能力,并提供了与实地其他方法相比的新状态结果。