Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the \webdataname{} dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce \smalldatasetname{}, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html
翻译:视觉问题解答的近期方法依赖于大尺度的附加说明的数据集。 但是,视频问答的手工说明是枯燥的、昂贵的、防止缩放的。 在这项工作中,我们建议避免人工注解,为视频问题生成大规模培训数据集,同时使用自动跨模式监管。 我们利用一个经培训的关于文本数据的生成问题变压器,并用它来生成来自转录的视频解析的问答配对。 根据已解析的视频,我们随后自动生成包含 69M 视频解答的 HowToVQA69M 数据集。要处理该数据集中不同答案的公开词汇,我们建议基于视频问题多模式变压器和答调变压器之间的对比性损失而建立的培训程序。 我们引入了零发视频QA的任务和视频QA 特别为罕见的解答, 我们的方法在MSRVTT-QA、 DAPNet-QA、 MSVD-QA 和如何训练的视频A 数据模型也用于高版本数据生成工具。