音频视觉语言地图在机器人导航中的应用 (Audio Visual Language Maps for Robot Navigation) - 专知论文

会员服务 ·

0

机器人 · 机器人导航 · 模态 · 多模 · 跨模态 ·

2023 年 3 月 27 日

Audio Visual Language Maps for Robot Navigation

翻译：音频视觉语言地图在机器人导航中的应用

Chenguang Huang,Oier Mees,Andy Zeng,Wolfram Burgard

from arxiv, Project page: https://avlmaps.github.io/

While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world - navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io.

翻译：虽然交互是一种多感官体验，但许多机器人继续主要依赖视觉感知来绘制和导航它们的环境。在这项工作中，我们提出了一种称为音频-视觉-语言地图（AVLMaps）的统一的三维空间地图表示，用于存储来自音频、视觉和语言提示的跨模态信息。AVLMaps 将在互联网规模数据上预训练的多模型基础模型的开放词汇能力融合到一个集中的三维体素网格中。在导航的上下文中，我们展示了 AVLMaps 能够使机器人系统根据跨模态查询，例如文本描述、图像或地标的音频片段，在地图中索引目标。尤其是加入音频信息使得机器人能够更可靠地消除目标位置的歧义。在模拟实验中，AVLMaps 使得机器人能够从多模态提示中实现零-shot多模态目标导航，并在模糊的场景中提供更好的回忆率50%。这些能力延伸到现实中的移动机器人 - 导航至涉及视觉、音频和空间概念的地标。视频和代码可在此网址获得：https://avlmaps.github.io。

0

相关内容

机器人

机器人（英语：Robot）包括一切模拟人类行为或思想与模拟其他生物的机械（如机器狗，机器猫等）。狭义上对机器人的定义还有很多分类法及争议，有些电脑程序甚至也被称为机器人。在当代工业中，机器人指能自动运行任务的人造机器设备，用以取代或协助人类工作，一般会是机电设备，由计算机程序或是电子电路控制。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

专知会员服务

44+阅读 · 2022年3月8日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

【斯坦福博士论文】视觉语言的多模态表示，102页pdf

专知会员服务

72+阅读 · 2021年7月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

【泡泡一分钟】单目视觉惯性SLAM的重定位，全局优化和地图融合

【泡泡一分钟】单目视觉惯性SLAM的重定位，全局优化和地图融合

泡泡机器人SLAM

59+阅读 · 2019年7月15日

【泡泡一分钟】三维卷积神经网络实现实时非模态三维目标检测

【泡泡一分钟】三维卷积神经网络实现实时非模态三维目标检测

泡泡机器人SLAM

12+阅读 · 2019年5月20日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】用于RGBD语义分割的三维图神经网络(ICCV2017-546)

【泡泡一分钟】用于RGBD语义分割的三维图神经网络(ICCV2017-546)

泡泡机器人SLAM

22+阅读 · 2018年12月4日

GPU加速和风格感知的艺术图像和谐克隆

国家自然科学基金

4+阅读 · 2014年12月31日

基于图像模型绘制的大规模场景自由可量测全景再现

国家自然科学基金

0+阅读 · 2013年12月31日

驾驶员视觉注意聚焦及转移模式与隧道空间通视性研究

国家自然科学基金

0+阅读 · 2013年12月31日

视频的中层视觉表达和高层行为识别研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向交互式情感计算的多模态信息融合建模研究

国家自然科学基金

1+阅读 · 2013年12月31日

用于视障者视觉辅助的物体3D空间信息视觉-听觉转换理论

国家自然科学基金

0+阅读 · 2012年12月31日

主动视觉注意的语义认知计算模型研究

国家自然科学基金

1+阅读 · 2009年12月31日

无重合多视域视觉信息融合与认知计算研究

国家自然科学基金

2+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments

Arxiv

0+阅读 · 2023年5月16日

TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation

Arxiv

0+阅读 · 2023年5月15日

Fast Traversability Estimation for Wild Visual Navigation

Arxiv

0+阅读 · 2023年5月15日

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Arxiv

0+阅读 · 2023年5月12日

An Object SLAM Framework for Association, Mapping, and High-Level Tasks

Arxiv

0+阅读 · 2023年5月12日

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

Arxiv

0+阅读 · 2023年5月12日

Foundations of Spatial Perception for Robotics: Hierarchical Representations and Real-time Systems

Arxiv

0+阅读 · 2023年5月11日

Visual Attention Methods in Deep Learning: An In-Depth Survey

Arxiv

44+阅读 · 2022年4月16日

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Arxiv

16+阅读 · 2022年3月25日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

VIP会员

文章信息

相关主题

机器人导航

相关VIP内容

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

【视觉和语言导航:任务、方法和未来方向的综述】Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

专知会员服务

37+阅读 · 2022年3月25日

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

【香港科技大学等】视觉-语言智能:任务、表示学习和大模型，Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

专知会员服务

44+阅读 · 2022年3月8日

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

【斯坦福Kevin Chen博士论文】视觉、语言和具身AI的多模态表示， Multimodal representations for vision, language, and embodied AI

专知会员服务

64+阅读 · 2022年3月6日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

【CVPR 2022】使用多模态Transformer的端到端视频对象分割，End-to-End Referring Video Object Segmentation with Multimodal Transformer

专知会员服务

28+阅读 · 2022年3月3日

【斯坦福博士论文】视觉语言的多模态表示，102页pdf

专知会员服务

72+阅读 · 2021年7月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

【北邮-腾讯AI】自监督学习音视觉说话人认证，Self-supervised learning for audio-visual speaker diarization

专知会员服务

26+阅读 · 2020年2月16日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能驱动的国防战术通信与网络：提升现代战争中的态势感知、安全性与自主决策 | 万字长文

《有人-无人轻型驱逐舰与中型无人水面艇支队在第二与第一岛链作战中的部署概念（CONOPS）》56页报告

《用于全球导航卫星系统电子干扰检测与分类的人工智能模型》2025最新107页

《利用射频传感器载荷增强无人机的侦察、监视与目标获取（ISR）能力》报告

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

【泡泡一分钟】单目视觉惯性SLAM的重定位，全局优化和地图融合

【泡泡一分钟】单目视觉惯性SLAM的重定位，全局优化和地图融合

泡泡机器人SLAM

59+阅读 · 2019年7月15日

【泡泡一分钟】三维卷积神经网络实现实时非模态三维目标检测

【泡泡一分钟】三维卷积神经网络实现实时非模态三维目标检测

泡泡机器人SLAM

12+阅读 · 2019年5月20日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】用于RGBD语义分割的三维图神经网络(ICCV2017-546)

【泡泡一分钟】用于RGBD语义分割的三维图神经网络(ICCV2017-546)

泡泡机器人SLAM

22+阅读 · 2018年12月4日

相关论文

GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments

Arxiv

0+阅读 · 2023年5月16日

TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation

Arxiv

0+阅读 · 2023年5月15日

Fast Traversability Estimation for Wild Visual Navigation

Arxiv

0+阅读 · 2023年5月15日

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Arxiv

0+阅读 · 2023年5月12日

An Object SLAM Framework for Association, Mapping, and High-Level Tasks

Arxiv

0+阅读 · 2023年5月12日

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks

Arxiv

0+阅读 · 2023年5月12日

Foundations of Spatial Perception for Robotics: Hierarchical Representations and Real-time Systems

Arxiv

0+阅读 · 2023年5月11日

Visual Attention Methods in Deep Learning: An In-Depth Survey

Arxiv

44+阅读 · 2022年4月16日

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Arxiv

16+阅读 · 2022年3月25日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

相关基金

GPU加速和风格感知的艺术图像和谐克隆

国家自然科学基金

4+阅读 · 2014年12月31日

基于图像模型绘制的大规模场景自由可量测全景再现

国家自然科学基金

0+阅读 · 2013年12月31日

驾驶员视觉注意聚焦及转移模式与隧道空间通视性研究

国家自然科学基金

0+阅读 · 2013年12月31日

视频的中层视觉表达和高层行为识别研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向交互式情感计算的多模态信息融合建模研究

国家自然科学基金

1+阅读 · 2013年12月31日

用于视障者视觉辅助的物体3D空间信息视觉-听觉转换理论

国家自然科学基金

0+阅读 · 2012年12月31日

主动视觉注意的语义认知计算模型研究

国家自然科学基金

1+阅读 · 2009年12月31日

无重合多视域视觉信息融合与认知计算研究

国家自然科学基金

2+阅读 · 2009年12月31日

句子语义的视觉表示研究

国家自然科学基金

4+阅读 · 2009年12月31日

基于本体的Deep Web搜索技术

国家自然科学基金

2+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员