展望和语言导航多式变革器 (History Aware Multimodal Transformer for Vision-and-Language Navigation) - 专知论文

会员服务 ·

0

多峰值 · 变换 · Next · Vision · 值域 ·

2021 年 10 月 25 日

History Aware Multimodal Transformer for Vision-and-Language Navigation

翻译：展望和语言导航多式变革器

Shizhe Chen,Pierre-Louis Guhur,Cordelia Schmid,Ivan Laptev

from arxiv, Accepted in NeurIPS 2021; project page at https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.

翻译：视觉和语言导航(VLN) 旨在建立自主视觉感官,在真实的场景中遵循指示和导航; 记住以前访问过的地点和采取的行动, VLN的多数方法使用经常性状态执行记忆; 相反, 我们引入了历史认知多式变换器(HAMT), 将长方位历史纳入多式联运决策。 HAMT 有效地将过去的所有全景观测编码为等级视觉变压器(VIT), 它首先将个人图像与VT编码, 然后在全景观察中模拟图像之间的空间关系, 并最终考虑到历史全景之间的时间关系。然后, 它将文本、历史和当前观察结合起来, 以预测下一步行动。我们首先用一些代理任务来培训HAMT终端到终端, 包括单步行动预测和空间关系预测, 然后利用强化学习来进一步改进导航政策。 HAMT在甚广范围的VLN任务上实现了新的艺术状态, 包括带有精细指示的VLN(R, RxR), 高级指示(R-R-R) 以及长期导航(R-R-LI) 显示我们-R- RM) 以长期任务。

0

相关内容

多峰值

NeurIPS 2021教程|OpenAI-Lilian Weng等：自监督学习与对比学习，105页ppt，

NeurIPS 2021教程|OpenAI-Lilian Weng等：自监督学习与对比学习，105页ppt，

专知会员服务

78+阅读 · 2021年12月10日

【NUS-Xavier教授】注意力神经网络，79页ppt

【NUS-Xavier教授】注意力神经网络，79页ppt

专知会员服务

65+阅读 · 2021年11月25日

【ICML2021】生成式视频转换器Transformers: 物体可以是文字吗?

专知会员服务

13+阅读 · 2021年8月20日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【复旦大学邱锡鹏教授】自然语言处理中的自注意力模型，53页ppt

【复旦大学邱锡鹏教授】自然语言处理中的自注意力模型，53页ppt

专知会员服务

129+阅读 · 2020年9月2日

【CVPR2020】视觉导航的神经拓扑SLAM，Neural Topological SLAM for Visual Navigation

【CVPR2020】视觉导航的神经拓扑SLAM，Neural Topological SLAM for Visual Navigation

专知会员服务

52+阅读 · 2020年5月26日

【ACL2019】基于学习注意力机制的知识图谱中关系预测的嵌入 Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

【ACL2019】基于学习注意力机制的知识图谱中关系预测的嵌入 Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

专知会员服务

122+阅读 · 2020年3月29日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

AAAI2020接受论文揭晓，1591篇上榜，接收率20.6%！最新论文抢鲜看

AAAI2020接受论文揭晓，1591篇上榜，接收率20.6%！最新论文抢鲜看

专知

42+阅读 · 2019年11月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

ELSA: Enhanced Local Self-Attention for Vision Transformer

Arxiv

0+阅读 · 2021年12月23日

Curriculum Learning for Vision-and-Language Navigation

Arxiv

8+阅读 · 2021年11月14日

Transformer in Transformer

Arxiv

11+阅读 · 2021年10月26日

Visual-and-Language Navigation: A Survey and Taxonomy

Arxiv

8+阅读 · 2021年8月26日

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Arxiv

3+阅读 · 2021年3月22日

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Arxiv

7+阅读 · 2020年6月11日

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Arxiv

9+阅读 · 2018年11月25日

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Arxiv

5+阅读 · 2018年4月5日

Investigations on Knowledge Base Embedding for Relation Prediction and Extraction

Arxiv

8+阅读 · 2018年2月6日

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Arxiv

3+阅读 · 2017年11月24日

VIP会员

文章信息

相关主题

相关VIP内容

NeurIPS 2021教程|OpenAI-Lilian Weng等：自监督学习与对比学习，105页ppt，

NeurIPS 2021教程|OpenAI-Lilian Weng等：自监督学习与对比学习，105页ppt，

专知会员服务

78+阅读 · 2021年12月10日

【NUS-Xavier教授】注意力神经网络，79页ppt

【NUS-Xavier教授】注意力神经网络，79页ppt

专知会员服务

65+阅读 · 2021年11月25日

【ICML2021】生成式视频转换器Transformers: 物体可以是文字吗?

专知会员服务

13+阅读 · 2021年8月20日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【复旦大学邱锡鹏教授】自然语言处理中的自注意力模型，53页ppt

【复旦大学邱锡鹏教授】自然语言处理中的自注意力模型，53页ppt

专知会员服务

129+阅读 · 2020年9月2日

【CVPR2020】视觉导航的神经拓扑SLAM，Neural Topological SLAM for Visual Navigation

【CVPR2020】视觉导航的神经拓扑SLAM，Neural Topological SLAM for Visual Navigation

专知会员服务

52+阅读 · 2020年5月26日

【ACL2019】基于学习注意力机制的知识图谱中关系预测的嵌入 Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

【ACL2019】基于学习注意力机制的知识图谱中关系预测的嵌入 Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs

专知会员服务

122+阅读 · 2020年3月29日

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

【华盛顿大学】用于视觉和语言导航的多视图学习，Multi-View Learning for Vision-and-Language Navigation

专知会员服务

31+阅读 · 2020年3月11日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【伯克利博士论文】通过真实世界实践赋能机器人自主性

军用无人机集群技术尚未成熟——但潜力可期

人工智能安全治理白皮书（2025）

AgentOps综述：分类、挑战与未来方向

相关资讯

AAAI2020接受论文揭晓，1591篇上榜，接收率20.6%！最新论文抢鲜看

AAAI2020接受论文揭晓，1591篇上榜，接收率20.6%！最新论文抢鲜看

专知

42+阅读 · 2019年11月11日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

ICLR2019最佳论文出炉

ICLR2019最佳论文出炉

专知

12+阅读 · 2019年5月6日

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

无人机视觉挑战赛 | ICCV 2019 Workshop—VisDrone2019

PaperWeekly

7+阅读 · 2019年5月5日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

ELSA: Enhanced Local Self-Attention for Vision Transformer

Arxiv

0+阅读 · 2021年12月23日

Curriculum Learning for Vision-and-Language Navigation

Arxiv

8+阅读 · 2021年11月14日

Transformer in Transformer

Arxiv

11+阅读 · 2021年10月26日

Visual-and-Language Navigation: A Survey and Taxonomy

Arxiv

8+阅读 · 2021年8月26日

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Arxiv

3+阅读 · 2021年3月22日

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Arxiv

7+阅读 · 2020年6月11日

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Arxiv

9+阅读 · 2018年11月25日

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Arxiv

5+阅读 · 2018年4月5日

Investigations on Knowledge Base Embedding for Relation Prediction and Extraction

Arxiv

8+阅读 · 2018年2月6日

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Arxiv

3+阅读 · 2017年11月24日

微信扫码咨询专知VIP会员