视力问题解答相对空间原因 (Weakly Supervised Relative Spatial Reasoning for Visual Question Answering) - 专知论文

会员服务 ·

0

视觉问答 · 自动问答 · 可理解性 · 监督 · state-of-the-art ·

2021 年 9 月 4 日

Weakly Supervised Relative Spatial Reasoning for Visual Question Answering

翻译：视力问题解答相对空间原因

Pratyay Banerjee,Tejas Gokhale,Yezhou Yang,Chitta Baral

from arxiv, Accepted to ICCV 2021. PaperId : ICCV2021-10857 Copyright transferred to IEEE ICCV. DOI will be updated later

Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V\&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-of-the-art transformer-based V\&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) -- object centroid estimation, and relative position estimation, and train V\&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge (in fully supervised, few-shot, and O.O.D settings) as well as improvements in relative spatial reasoning. Code and data will be released \href{https://github.com/pratyay-banerjee/weak_sup_vqa}{here}.

翻译：视觉推理的一个重要方面是空间理解,它涉及了解物体的相对位置,即隐含地学习场景的几何特征。在这项工作中,我们通过预测对象的对称相对位置作为分类和回归任务来评估V ⁇ L模型对于这种几何理解的忠诚度。我们的研究结果表明,基于V ⁇ L的状态变压器模型缺乏足够的能力来完成这项任务。我们为此而设计的两个目标是3D空间推理(SR)的替代值 -- -- 对象偏差估计和相对位置估计,并将V ⁇ L的可靠性与从外层深度估测器进行微弱的监督。这导致“GQA”的视觉问题回答挑战的准确性有了相当大的提高(在充分监督、少发和O.O.D设置方面)以及相对空间推理的改进。代码和数据将发布。

0

相关内容

视觉问答

视觉问答（Visual Question Answering，VQA），是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下： A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output[1]。翻译为中文：一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入，以生成一条自然语言答案作为输出。简单来说，VQA就是给定的图片进行问答。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

【USC2021】常识推理，47页ppt，Commonsense Reasoning in the Wild

专知会员服务

33+阅读 · 2021年10月9日

【ICCV2021】域分离的全时段自监督单目深度估计

专知会员服务

7+阅读 · 2021年9月22日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

【CVPR2020】自监督的深度视觉测程与在线适应，Self-Supervised Deep Visual Odometry

【CVPR2020】自监督的深度视觉测程与在线适应，Self-Supervised Deep Visual Odometry

专知会员服务

32+阅读 · 2020年5月14日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

专知会员服务

10+阅读 · 2019年10月30日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

CVPR2020接收论文开源代码

CVPR2020接收论文开源代码

专知

30+阅读 · 2020年2月29日

已删除

将门创投

5+阅读 · 2019年8月19日

【泡泡一分钟】OFF:快速鲁棒视频动作识别的运动表征

【泡泡一分钟】OFF:快速鲁棒视频动作识别的运动表征

泡泡机器人SLAM

3+阅读 · 2019年3月12日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense

Arxiv

0+阅读 · 2021年10月27日

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Arxiv

0+阅读 · 2021年10月25日

Few-shot Semantic Segmentation with Self-supervision from Pseudo-classes

Arxiv

0+阅读 · 2021年10月22日

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Arxiv

13+阅读 · 2020年12月14日

How Useful is Self-Supervised Pretraining for Visual Tasks?

How Useful is Self-Supervised Pretraining for Visual Tasks?

Arxiv

9+阅读 · 2020年3月31日

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Arxiv

3+阅读 · 2019年5月10日

Relation-aware Graph Attention Network for Visual Question Answering

Arxiv

4+阅读 · 2019年3月29日

Joint Image Captioning and Question Answering

Arxiv

6+阅读 · 2018年5月22日

Learning to Count Objects in Natural Images for Visual Question Answering

Arxiv

12+阅读 · 2018年2月15日

Natural Language Guided Visual Relationship Detection

Arxiv

3+阅读 · 2017年11月21日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

【USC2021】常识推理，47页ppt，Commonsense Reasoning in the Wild

专知会员服务

33+阅读 · 2021年10月9日

【ICCV2021】域分离的全时段自监督单目深度估计

专知会员服务

7+阅读 · 2021年9月22日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

【CVPR2020】自监督的深度视觉测程与在线适应，Self-Supervised Deep Visual Odometry

【CVPR2020】自监督的深度视觉测程与在线适应，Self-Supervised Deep Visual Odometry

专知会员服务

32+阅读 · 2020年5月14日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

专知会员服务

10+阅读 · 2019年10月30日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《步兵小单元山地严寒作战指南》美军最新条令200页

《联合作战概念的发展》最新报告

俄制无人机弹药

《复杂场景下自主着陆的模型预测控制技术》92页

相关资讯

CVPR2020接收论文开源代码

CVPR2020接收论文开源代码

专知

30+阅读 · 2020年2月29日

已删除

将门创投

5+阅读 · 2019年8月19日

【泡泡一分钟】OFF:快速鲁棒视频动作识别的运动表征

【泡泡一分钟】OFF:快速鲁棒视频动作识别的运动表征

泡泡机器人SLAM

3+阅读 · 2019年3月12日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

相关论文

CRIC: A VQA Dataset for Compositional Reasoning on Vision and Commonsense

Arxiv

0+阅读 · 2021年10月27日

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Arxiv

0+阅读 · 2021年10月25日

Few-shot Semantic Segmentation with Self-supervision from Pseudo-classes

Arxiv

0+阅读 · 2021年10月22日

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Arxiv

13+阅读 · 2020年12月14日

How Useful is Self-Supervised Pretraining for Visual Tasks?

How Useful is Self-Supervised Pretraining for Visual Tasks?

Arxiv

9+阅读 · 2020年3月31日

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Arxiv

3+阅读 · 2019年5月10日

Relation-aware Graph Attention Network for Visual Question Answering

Arxiv

4+阅读 · 2019年3月29日

Joint Image Captioning and Question Answering

Arxiv

6+阅读 · 2018年5月22日

Learning to Count Objects in Natural Images for Visual Question Answering

Arxiv

12+阅读 · 2018年2月15日

Natural Language Guided Visual Relationship Detection

Arxiv

3+阅读 · 2017年11月21日

微信扫码咨询专知VIP会员