MERLOT: 多式联运神经文脚本知识模型 (MERLOT: Multimodal Neural Script Knowledge Models) - 专知论文

会员服务 ·

0

MERLOT · 多峰值 · MoDELS · Performer · state-of-the-art ·

2021 年 6 月 4 日

MERLOT: Multimodal Neural Script Knowledge Models

翻译：MERLOT: 多式联运神经文脚本知识模型

Rowan Zellers,Ximing Lu,Jack Hessel,Youngjae Yu,Jae Sung Park,Jize Cao,Ali Farhadi,Yejin Choi

from arxiv, project page at https://rowanzellers.com/merlot

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes). Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

翻译：作为人类,我们理解视觉世界中的各种事件,在时间上进行多式联运推理,以推断过去、现在和未来。我们引入了MERLOT,这是一种通过完全无标签、自我监督的方式,通过观看数以百万计的YouTube视频,以无标签、完全不受自我监督的方式通过转录的演讲来学习多式文字知识的模式。通过使用框架级(空间)和视频级(时空)目标相结合的预演,我们的模型不仅学会将图像与时间对应的文字相匹配,而且将全球正在发生的事情背景化。结果,MERLOT展示了时间常识的强大外格表现,在12个不同的视频QA数据集上实现了最新的艺术表现。它还将静态图像传送到世界,允许模型在视觉场后对动态背景进行理解。在视觉常识学理性学上,MERLOT回答的问题正确无误,80.6%的准确度、超过3 % 水平的近于类似目标的状态模型。结果,MERLOT展示了3级以上的超标点,甚至超标级的模型,在12个不同的视频Q数据集数据集数据集中,并展示了该级前的升级分析。

1

相关内容

MERLOT

Knowledge In PLM: 语言模型可以作为一种知识库吗？

专知会员服务

30+阅读 · 2021年6月15日

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

专知会员服务

102+阅读 · 2020年4月25日

【论文推荐】多模态知识图谱上的端到端实体分类，End-to-End Entity Classification on Multimodal Knowledge Graphs

【论文推荐】多模态知识图谱上的端到端实体分类，End-to-End Entity Classification on Multimodal Knowledge Graphs

专知会员服务

50+阅读 · 2020年3月30日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【论文|知识图谱】小样本知识图谱补全，Few-Shot Knowledge Graph Completion

【论文|知识图谱】小样本知识图谱补全，Few-Shot Knowledge Graph Completion

专知会员服务

120+阅读 · 2019年11月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【EMNLP 2019】Discreteness in Neural Natural Language Processing，神经自然语言处理中的离散性，附303页PPT免费下载

【EMNLP 2019】Discreteness in Neural Natural Language Processing，神经自然语言处理中的离散性，附303页PPT免费下载

专知会员服务

26+阅读 · 2019年11月7日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【音乐】Attention

【音乐】Attention

英语演讲视频每日一推

3+阅读 · 2017年8月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Arxiv

2+阅读 · 2021年7月26日

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

Arxiv

20+阅读 · 2021年5月27日

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Arxiv

5+阅读 · 2021年4月22日

M6: A Chinese Multimodal Pretrainer

Arxiv

8+阅读 · 2021年3月2日

Classification by Attention: Scene Graph Classification with Prior Knowledge

Arxiv

8+阅读 · 2020年11月19日

已删除

Arxiv

32+阅读 · 2020年3月23日

Neural Module Networks for Reasoning over Text

Neural Module Networks for Reasoning over Text

Arxiv

9+阅读 · 2019年12月10日

Commonsense for Generative Multi-Hop Question Answering Tasks

Arxiv

4+阅读 · 2018年9月17日

Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Arxiv

5+阅读 · 2018年8月6日

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs

Arxiv

4+阅读 · 2018年6月25日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

Knowledge In PLM: 语言模型可以作为一种知识库吗？

专知会员服务

30+阅读 · 2021年6月15日

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

专知会员服务

102+阅读 · 2020年4月25日

【论文推荐】多模态知识图谱上的端到端实体分类，End-to-End Entity Classification on Multimodal Knowledge Graphs

【论文推荐】多模态知识图谱上的端到端实体分类，End-to-End Entity Classification on Multimodal Knowledge Graphs

专知会员服务

50+阅读 · 2020年3月30日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【论文|知识图谱】小样本知识图谱补全，Few-Shot Knowledge Graph Completion

【论文|知识图谱】小样本知识图谱补全，Few-Shot Knowledge Graph Completion

专知会员服务

120+阅读 · 2019年11月30日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【EMNLP 2019】Discreteness in Neural Natural Language Processing，神经自然语言处理中的离散性，附303页PPT免费下载

【EMNLP 2019】Discreteness in Neural Natural Language Processing，神经自然语言处理中的离散性，附303页PPT免费下载

专知会员服务

26+阅读 · 2019年11月7日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【斯坦福大学博士论文】构建大语言模型的交互式学习流程管线

ACL 2025 | 弹性可伸缩知识图谱嵌入

【ICML2025】扩散模型的二重性

医学图像分割中的通用模型：与任务特定方法的综述与性能比较

相关资讯

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】MXNet深度情感分析实战

【推荐】MXNet深度情感分析实战

机器学习研究会

16+阅读 · 2017年10月4日

【音乐】Attention

【音乐】Attention

英语演讲视频每日一推

3+阅读 · 2017年8月22日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Arxiv

2+阅读 · 2021年7月26日

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

Arxiv

20+阅读 · 2021年5月27日

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Arxiv

5+阅读 · 2021年4月22日

M6: A Chinese Multimodal Pretrainer

Arxiv

8+阅读 · 2021年3月2日

Classification by Attention: Scene Graph Classification with Prior Knowledge

Arxiv

8+阅读 · 2020年11月19日

已删除

Arxiv

32+阅读 · 2020年3月23日

Neural Module Networks for Reasoning over Text

Neural Module Networks for Reasoning over Text

Arxiv

9+阅读 · 2019年12月10日

Commonsense for Generative Multi-Hop Question Answering Tasks

Arxiv

4+阅读 · 2018年9月17日

Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Arxiv

5+阅读 · 2018年8月6日

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs

Arxiv

4+阅读 · 2018年6月25日

微信扫码咨询专知VIP会员