MERLOT储备:通过视觉、语言和声音获得神经脚本知识 (MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound) - 专知论文

会员服务 ·

0

MERLOT · 多峰值 · 知识 (knowledge) · 学成 · Vision ·

2022 年 5 月 13 日

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

翻译：MERLOT储备:通过视觉、语言和声音获得神经脚本知识

Rowan Zellers,Jiasen Lu,Ximing Lu,Youngjae Yu,Yanpeng Zhao,Mohammadreza Salehi,Aditya Kusupati,Jack Hessel,Ali Farhadi,Yejin Choi

from arxiv, CVPR 2022. Project page at https://rowanzellers.com/merlotreserve

As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong multimodal representations. When finetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

翻译：作为人类,我们环绕着一个多式联运的世界,从我们所有的感官中建立整体理解。我们引入了MERLOT 储备模式,这个模式代表长期的视频 -- -- 通过从音频、字幕和视频框架中学习的新培训目标,通过一个从音频、字幕和视频框架中学习的新培训目标,我们用一个视频,用一个MASK标志取代文本和音频片片段;模型通过选择正确的遮蔽外片片片段学习;我们的目标比其他方法学习得更快,并且表现得非常好:我们在2,000万个YouTube视频上进行预演。经验性结果显示,MERLOT 储备可以学习强大的多式联运代表。在微调调整后,它设置了关于视觉常识(VCRR)、TVQA和Kiniticals-600的最新知识;在以前的工作上,我们用5%、7%和1.5%的表现表现。推论表明,这些任务从音频前训练中受益 -- -- 甚至VCR, QA任务围绕图像(没有音响) 。此外,我们的目标可以进行外置预测,揭示出更好的多式共同理解。在完全的图像上,我们未来的研究中,我们提出的模型分析前的模型将获得了重要前景。

0

相关内容

MERLOT

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

离子液体溶剂中磁性固定化酶催化合成半合成抗生素和介质作用机制研究

国家自然科学基金

0+阅读 · 2016年12月31日

牛磺酸抑制AS肉鸡右心肥大过程中calpains介导细胞凋亡作用的研究

国家自然科学基金

0+阅读 · 2015年12月31日

中国田鼠亚科 Microtini族(Rodentia: Cricetidae: Arvicolinae)的分类与系统发育研究

国家自然科学基金

0+阅读 · 2014年12月31日

空蚀对镍基Inconel600合金钝化膜电化学性能影响

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

功能化离子液体催化转换CO2合成含氮杂环化合物

国家自然科学基金

0+阅读 · 2012年12月31日

拟Frobenius-Lusztig核

国家自然科学基金

0+阅读 · 2012年12月31日

混凝土箱梁和T梁交接处剪应力传递与应力分布

国家自然科学基金

0+阅读 · 2012年12月31日

高速铁路轨道板混凝土的电阻特性及材料设计研究

国家自然科学基金

0+阅读 · 2009年12月31日

稀土Mn/Fe基金属间化合物的结构及其磁、电性能的研究

国家自然科学基金

0+阅读 · 2008年12月31日

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy

Arxiv

0+阅读 · 2022年7月4日

Towards Explanation for Unsupervised Graph-Level Representation Learning

Arxiv

0+阅读 · 2022年7月1日

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Arxiv

14+阅读 · 2022年1月20日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Arxiv

17+阅读 · 2021年6月25日

Graph Neural Networks for Natural Language Processing: A Survey

Arxiv

36+阅读 · 2021年6月10日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

VIP会员

文章信息

相关主题

知识 (knowledge)

相关VIP内容

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大型语言模型遇上文本属性图：一种融合框架与应用的综述

人工智能赋能自主武器与人类控制第三部分：人类控制与系统操作员 | 35页

【博士论文】用于概率程序与生成模型的变分推断

军事指挥控制系统：2025年5种用途

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

ACM TOMM Call for Papers

ACM TOMM Call for Papers

CCF多媒体专委会

2+阅读 · 2022年3月23日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

相关论文

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy

ViRel: Unsupervised Visual Relations Discovery with Graph-level Analogy

Arxiv

0+阅读 · 2022年7月4日

Towards Explanation for Unsupervised Graph-Level Representation Learning

Arxiv

0+阅读 · 2022年7月1日

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Arxiv

14+阅读 · 2022年1月20日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

Arxiv

17+阅读 · 2021年6月25日

Graph Neural Networks for Natural Language Processing: A Survey

Arxiv

36+阅读 · 2021年6月10日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

相关基金

离子液体溶剂中磁性固定化酶催化合成半合成抗生素和介质作用机制研究

国家自然科学基金

0+阅读 · 2016年12月31日

牛磺酸抑制AS肉鸡右心肥大过程中calpains介导细胞凋亡作用的研究

国家自然科学基金

0+阅读 · 2015年12月31日

中国田鼠亚科 Microtini族(Rodentia: Cricetidae: Arvicolinae)的分类与系统发育研究

国家自然科学基金

0+阅读 · 2014年12月31日

空蚀对镍基Inconel600合金钝化膜电化学性能影响

国家自然科学基金

0+阅读 · 2013年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

功能化离子液体催化转换CO2合成含氮杂环化合物

国家自然科学基金

0+阅读 · 2012年12月31日

拟Frobenius-Lusztig核

国家自然科学基金

0+阅读 · 2012年12月31日

混凝土箱梁和T梁交接处剪应力传递与应力分布

国家自然科学基金

0+阅读 · 2012年12月31日

高速铁路轨道板混凝土的电阻特性及材料设计研究

国家自然科学基金

0+阅读 · 2009年12月31日

稀土Mn/Fe基金属间化合物的结构及其磁、电性能的研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员