Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code and datasets will be made publicly available at https://antoyang.github.io/just-ask.html.
翻译:视觉问题解答的近期方法依赖于大尺度的附加说明的数据集。 但是,视频问答的手工说明是乏味的,昂贵的,而且无法缩放。 在这项工作中,我们建议避免人工注解,为视频问题生成大规模培训数据集,同时使用自动跨模式的监管;我们利用一个经培训的文本数据生成问题变压器,并用它生成解答视频解析的问答配对。根据已描述的视频,我们随后自动生成包含69M视频问答的 HowVQA69M数据集。为了处理该数据集中不同答案的公开词汇,我们建议基于视频问题多式变压器和答调变压器之间的对比性损失的培训程序。我们引入了零发视频QA任务,并展示了优异的结果,特别是对于罕见的解答。我们展示了如何大大优于MSRVTTT-QA、MSVD-QA、活动网-QA和How2QA。最后,我们将在视频-QA高质量A上引入一个可公开获取的高级数据。