语言参考:3D视觉定位空间语言模型 (LanguageRefer: Spatial-Language Model for 3D Visual Grounding) - 专知论文

会员服务 ·

0

Performer · 可辨认的 · MoDELS · 3D · 边界框 ·

2021 年 11 月 4 日

LanguageRefer: Spatial-Language Model for 3D Visual Grounding

翻译：语言参考:3D视觉定位空间语言模型

Junha Roh,Karthik Desingh,Ali Farhadi,Dieter Fox

from arxiv, 11 pages, 3 figures

For robots to understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that comprehend referential language to identify common objects in real-world 3D scenes. In this paper, we introduce a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of point clouds with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model successfully identifies the target object from a set of potential candidates. Specifically, LanguageRefer uses a transformer-based architecture that combines spatial embedding from bounding boxes with fine-tuned language embeddings from DistilBert to predict the target object. We show that it performs competitively on visio-linguistic datasets proposed by ReferIt3D. Further, we analyze its spatial reasoning task performance decoupled from perception noise, the accuracy of view-dependent utterances, and viewpoint annotations for potential robotics applications.

翻译：对于机器人来说,在不久的将来理解人的指示并完成有意义的任务,重要的是要开发理解优选语言以识别现实世界 3D 场景中常见物体的学习模型。在本文中,我们引入了3D视觉定位问题的空间语言模型。具体地说,鉴于三维场景以点云的形式与三维潜在对象候选方的立体捆绑盒相重建,以及提及场景中目标对象的语言表述,我们的模型成功地从一组潜在对象中确定了目标对象。具体地说, 语言Refer 使用基于变压器的架构,将捆绑框中的空间嵌入与DistilBert 的精细调整语言嵌入结合起来,以预测目标对象。我们显示,它具有竞争力地运行了三维维的数据集。此外,我们分析了其空间推理工作性与感知噪音、依赖视觉的言词的准确性以及潜在机器人应用的视角说明脱钩。

0

相关内容

Performer

ICCV 2021 Oral | 基于点云的类级别刚体与带关节物体位姿追踪

专知会员服务

11+阅读 · 2021年9月23日

最新《3D医疗图像处理》综述论文，23页pdf，3D Deep Learning on Medical Images: A Review

最新《3D医疗图像处理》综述论文，23页pdf，3D Deep Learning on Medical Images: A Review

专知会员服务

60+阅读 · 2020年7月14日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-微软-CMU】视频物体分割的一种直推方法，Video Object Segmentation

【CVPR2020-微软-CMU】视频物体分割的一种直推方法，Video Object Segmentation

专知会员服务

7+阅读 · 2020年4月16日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

【厦门大学】综述：深度学习3D点云分割，Review: deep learning on 3D point clouds

【厦门大学】综述：深度学习3D点云分割，Review: deep learning on 3D point clouds

专知会员服务

71+阅读 · 2020年1月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

计算机 | 入门级EI会议ICVRIS 2019诚邀稿件

计算机 | 入门级EI会议ICVRIS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年6月24日

CVPR 2019 | 重磅！34篇 CVPR2019 论文实现代码

CVPR 2019 | 重磅！34篇 CVPR2019 论文实现代码

AI研习社

11+阅读 · 2019年6月21日

学会期刊丨《中国人工智能学会通讯》2019年第9卷第04期

学会期刊丨《中国人工智能学会通讯》2019年第9卷第04期

中国人工智能学会

6+阅读 · 2019年4月30日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新五篇视觉问答相关论文—深度学习评价、交互注意融合、VizWiz、引导注意力、

【论文推荐】最新五篇视觉问答相关论文—深度学习评价、交互注意融合、VizWiz、引导注意力、

专知

10+阅读 · 2018年6月8日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

视觉机械臂 visual-pushing-grasping

视觉机械臂 visual-pushing-grasping

CreateAMind

3+阅读 · 2018年5月25日

IEEE2018|An Accurate and Real-time 3D Tracking System for Robots

IEEE2018|An Accurate and Real-time 3D Tracking System for Robots

极市平台

4+阅读 · 2018年4月19日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

De-rendering 3D Objects in the Wild

Arxiv

0+阅读 · 2022年1月6日

Incremental Object Grounding Using Scene Graphs

Arxiv

0+阅读 · 2022年1月6日

Towards realistic symmetry-based completion of previously unseen point clouds

Arxiv

0+阅读 · 2022年1月5日

Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Arxiv

12+阅读 · 2020年2月27日

Joint Monocular 3D Vehicle Detection and Tracking

Joint Monocular 3D Vehicle Detection and Tracking

Arxiv

8+阅读 · 2018年12月2日

Did the Model Understand the Question?

Arxiv

4+阅读 · 2018年5月14日

3D Pose Estimation and 3D Model Retrieval for Objects in the Wild

Arxiv

7+阅读 · 2018年3月30日

Learning to Count Objects in Natural Images for Visual Question Answering

Arxiv

12+阅读 · 2018年2月15日

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

Arxiv

6+阅读 · 2018年1月24日

VQA: Visual Question Answering

Arxiv

9+阅读 · 2016年10月27日

VIP会员

文章信息

相关主题

相关VIP内容

ICCV 2021 Oral | 基于点云的类级别刚体与带关节物体位姿追踪

专知会员服务

11+阅读 · 2021年9月23日

最新《3D医疗图像处理》综述论文，23页pdf，3D Deep Learning on Medical Images: A Review

最新《3D医疗图像处理》综述论文，23页pdf，3D Deep Learning on Medical Images: A Review

专知会员服务

60+阅读 · 2020年7月14日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【CVPR2020-微软-CMU】视频物体分割的一种直推方法，Video Object Segmentation

【CVPR2020-微软-CMU】视频物体分割的一种直推方法，Video Object Segmentation

专知会员服务

7+阅读 · 2020年4月16日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

【DeepMind-牛津-CMU-CVPR2020】无监督词映射视觉基准，Visual Grounding in Video

专知会员服务

12+阅读 · 2020年3月13日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

【厦门大学】综述：深度学习3D点云分割，Review: deep learning on 3D point clouds

【厦门大学】综述：深度学习3D点云分割，Review: deep learning on 3D point clouds

专知会员服务

71+阅读 · 2020年1月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《毁灭算法：解析以色列在加沙的AI军事行动》

【COLT 2025最新教程】语言生成

以机器速度锁定目标：人工智能的能力与局限

【ICML2025】通过在线世界模型规划的持续强化学习

相关资讯

计算机 | 入门级EI会议ICVRIS 2019诚邀稿件

计算机 | 入门级EI会议ICVRIS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年6月24日

CVPR 2019 | 重磅！34篇 CVPR2019 论文实现代码

CVPR 2019 | 重磅！34篇 CVPR2019 论文实现代码

AI研习社

11+阅读 · 2019年6月21日

学会期刊丨《中国人工智能学会通讯》2019年第9卷第04期

学会期刊丨《中国人工智能学会通讯》2019年第9卷第04期

中国人工智能学会

6+阅读 · 2019年4月30日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新五篇视觉问答相关论文—深度学习评价、交互注意融合、VizWiz、引导注意力、

【论文推荐】最新五篇视觉问答相关论文—深度学习评价、交互注意融合、VizWiz、引导注意力、

专知

10+阅读 · 2018年6月8日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

视觉机械臂 visual-pushing-grasping

视觉机械臂 visual-pushing-grasping

CreateAMind

3+阅读 · 2018年5月25日

IEEE2018|An Accurate and Real-time 3D Tracking System for Robots

IEEE2018|An Accurate and Real-time 3D Tracking System for Robots

极市平台

4+阅读 · 2018年4月19日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

相关论文

De-rendering 3D Objects in the Wild

Arxiv

0+阅读 · 2022年1月6日

Incremental Object Grounding Using Scene Graphs

Arxiv

0+阅读 · 2022年1月6日

Towards realistic symmetry-based completion of previously unseen point clouds

Arxiv

0+阅读 · 2022年1月5日

Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Arxiv

12+阅读 · 2020年2月27日

Joint Monocular 3D Vehicle Detection and Tracking

Joint Monocular 3D Vehicle Detection and Tracking

Arxiv

8+阅读 · 2018年12月2日

Did the Model Understand the Question?

Arxiv

4+阅读 · 2018年5月14日

3D Pose Estimation and 3D Model Retrieval for Objects in the Wild

Arxiv

7+阅读 · 2018年3月30日

Learning to Count Objects in Natural Images for Visual Question Answering

Arxiv

12+阅读 · 2018年2月15日

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

Arxiv

6+阅读 · 2018年1月24日

VQA: Visual Question Answering

Arxiv

9+阅读 · 2016年10月27日

微信扫码咨询专知VIP会员