Modern approaches to visual question answering require large annotated datasets for training. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and to learn video question answering (VideoQA) from millions of readily-available narrated videos. We propose to automatically generate question-answer pairs from transcribed video narrations leveraging a state-of-the-art text transformer pipeline and obtain a new large-scale VideoQA training dataset. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer embedding. We evaluate our model on the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate that finetuning our model on target datasets significantly outperforms the state of the art on MSRVTT-QA, MSVD-QA and ActivityNet-QA. Finally, for a detailed evaluation we introduce a new manually annotated VideoQA dataset with reduced language biases and high quality annotations. Our code and datasets will be made publicly available at https://www.di.ens.fr/willow/research/just-ask/ .
翻译:视频解答的现代方法需要大量的附加说明的数据集。但是,视频解答的人工说明是乏味的,昂贵的,而且无法缩放。在这项工作中,我们建议避免人工注解,并从成百上千万个随时可得的解说视频中学习解答(VideoQA)视频解答(VideoQA)的视频。我们建议利用最先进的文本变压器管道,自动生成解答视频解析的问答配对,并获得新的大型视频QA培训数据集。为了处理该数据集中不同解答的公开词汇,我们提议了一个基于视频问题多式变换器和答案嵌入之间的对比损失的培训程序。我们评估了我们零发视频解答任务的模式,并展示了优异的答案。此外,我们展示了我们对目标数据集模型的细微调整,大大超越了MSRVTTT-QA、MSVDQA-QA和ActionNet-QA的艺术状态。最后,我们提出一个详细的评估程序,我们用新的手动性格式显示高压数据。