只需问:学习从数以百万计的叙事影片中解答问题 (Just Ask: Learning to Answer Questions from Millions of Narrated Videos)

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code and datasets will be made publicly available at https://antoyang.github.io/just-ask.html.

翻译：视觉问题解答的近期方法依赖于大尺度的附加说明的数据集。但是,视频问答的手工说明是乏味的,昂贵的,而且无法缩放。在这项工作中,我们建议避免人工注解,为视频问题生成大规模培训数据集,同时使用自动跨模式的监管;我们利用一个经培训的文本数据生成问题变压器,并用它生成解答视频解析的问答配对。根据已描述的视频,我们随后自动生成包含69M视频问答的 HowVQA69M数据集。为了处理该数据集中不同答案的公开词汇,我们建议基于视频问题多式变压器和答调变压器之间的对比性损失的培训程序。我们引入了零发视频QA任务,并展示了优异的结果,特别是对于罕见的解答。我们展示了如何大大优于MSRVTTT-QA、MSVD-QA、活动网-QA和How2QA。最后,我们将在视频-QA高质量A上引入一个可公开获取的高级数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【图解自监督学习】《The Illustrated Self-Supervised Learning》by Amit Chaudhary

专知会员服务

43+阅读 · 2020年2月25日

MIT-深度学习Deep Learning State of the Art in 2020，87页ppt

专知会员服务

62+阅读 · 2020年2月17日

【AAAI2020-Oral】自监督时空学习的视频完形程序，Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

专知会员服务

30+阅读 · 2020年1月2日