MERLOT储备:通过视觉、语言和声音获得神经脚本知识 (MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound) - 专知论文

会员服务 ·

0

MERLOT · 多峰值 · 学成 · Vision · MoDELS ·

2022 年 1 月 7 日

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

翻译：MERLOT储备:通过视觉、语言和声音获得神经脚本知识

Rowan Zellers,Jiasen Lu,Ximing Lu,Youngjae Yu,Yanpeng Zhao,Mohammadreza Salehi,Aditya Kusupati,Jack Hessel,Ali Farhadi,Yejin Choi

from arxiv, project page at https://rowanzellers.com/merlotreserve

As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

翻译：作为人类,我们通过我们所有的感官,利用每个人的感知性投入环游世界,纠正其他视频。我们引入了MERLOT 保留模式,这是一个长期共同展示视频的模型,通过一个从音频、字幕和视频框架学习的新培训目标,通过一个从音频、字幕和视频框架学习的新培训目标,我们用一个MASK标志取代文本和音频片片段;模型通过选择正确的遮蔽外片片段学习;我们的目标比替代方案学习得更快,并且表现得非常出色:我们在YouTube的2 000万视频上做了预设。经验性结果显示,MERLOT 储备通过所有构成模式学习视频的强烈表现。在微调时,它设置了一个新的VCR和TVQA的艺术状态,比以前的工作分别高出5%和7%。缩略图表明,这两项任务都得益于音预训练 -- 甚至VCRR, QA任务的核心是图像(没有声音) 。此外,我们的目标可以进行外置的预测,揭示强大的多式共同理解理解。在完全零镜头中,我们模型中,最近提出的将获得有竞争力分析的四度分析。

0

相关内容

MERLOT

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

移动互联网的用户隐私保护研究

国家自然科学基金

2+阅读 · 2017年12月31日

头穴丛刺法调控脑缺血后星形胶质细胞活化对神经元保护的研究

国家自然科学基金

0+阅读 · 2014年12月31日

移动云服务中轻量级设备隐私保护技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

保护隐私的海量数据挖掘

国家自然科学基金

0+阅读 · 2012年12月31日

随机时滞动态网络的分岔控制与优化

国家自然科学基金

0+阅读 · 2012年12月31日

基于广义骑士巡游的图像和视频压缩与加密同步技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

乳腺癌化疗所致记忆障碍的脑机制及其康复的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于感知Hash和脆弱水印的数字音频保护模型

国家自然科学基金

0+阅读 · 2009年12月31日

针刺长强对FMRX1基因敲除小鼠突触可塑性影响的研究

国家自然科学基金

0+阅读 · 2009年12月31日

卒中后脑连接可塑性磁共振成像纵向研究

国家自然科学基金

0+阅读 · 2009年12月31日

K-LITE: Learning Transferable Visual Models with External Knowledge

Arxiv

2+阅读 · 2022年4月20日

Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Arxiv

0+阅读 · 2022年4月19日

Coarse-to-Fine Reasoning for Visual Question Answering

Arxiv

0+阅读 · 2022年4月19日

Natural Language Descriptions of Deep Visual Features

Arxiv

0+阅读 · 2022年4月18日

Contrastive Demonstration Tuning for Pre-trained Language Models

Arxiv

0+阅读 · 2022年4月18日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

Arxiv

17+阅读 · 2018年6月5日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大型语言模型遇上文本属性图：一种融合框架与应用的综述

人工智能赋能自主武器与人类控制第三部分：人类控制与系统操作员 | 35页

【博士论文】用于概率程序与生成模型的变分推断

军事指挥控制系统：2025年5种用途

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

【论文推荐】最新四篇CVPR2018 视频描述生成相关论文—双向注意力、Transformer、重构网络、层次强化学习

专知

31+阅读 · 2018年6月4日

相关论文

K-LITE: Learning Transferable Visual Models with External Knowledge

Arxiv

2+阅读 · 2022年4月20日

Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Arxiv

0+阅读 · 2022年4月19日

Coarse-to-Fine Reasoning for Visual Question Answering

Arxiv

0+阅读 · 2022年4月19日

Natural Language Descriptions of Deep Visual Features

Arxiv

0+阅读 · 2022年4月18日

Contrastive Demonstration Tuning for Pre-trained Language Models

Arxiv

0+阅读 · 2022年4月18日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

Arxiv

17+阅读 · 2018年6月5日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

Arxiv

17+阅读 · 2018年3月20日

相关基金

移动互联网的用户隐私保护研究

国家自然科学基金

2+阅读 · 2017年12月31日

头穴丛刺法调控脑缺血后星形胶质细胞活化对神经元保护的研究

国家自然科学基金

0+阅读 · 2014年12月31日

移动云服务中轻量级设备隐私保护技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

保护隐私的海量数据挖掘

国家自然科学基金

0+阅读 · 2012年12月31日

随机时滞动态网络的分岔控制与优化

国家自然科学基金

0+阅读 · 2012年12月31日

基于广义骑士巡游的图像和视频压缩与加密同步技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

乳腺癌化疗所致记忆障碍的脑机制及其康复的研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于感知Hash和脆弱水印的数字音频保护模型

国家自然科学基金

0+阅读 · 2009年12月31日

针刺长强对FMRX1基因敲除小鼠突触可塑性影响的研究

国家自然科学基金

0+阅读 · 2009年12月31日

卒中后脑连接可塑性磁共振成像纵向研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员