Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.
翻译:视频解答( VideoQA ) 是一项复杂的任务,需要多种多模式的培训数据。 但是,对视频的问答进行人工说明是乏味的,禁止扩缩。 为了解决这个问题,最近的方法考虑到零点设置,没有手工解答视觉解答。 特别是, 一种有希望的方法使在网络规模的纯文本数据上预先训练的冻结的自动反向语言模型适应多模式投入。 相反,我们在这里建立冻结的双向双向语言模型(BILM ), 并表明这种方法为零点拍视频QA提供了更强、更便宜的替代品。 特别是, (一) 我们使用轻度培训模块将视觉输入与冻结的BILMM 组合在一起, (二) 我们使用Web-crac-crad多模式培训这种模块, 最后 (三) 我们通过掩码语言模型进行零点视频QA的测试, 我们提议的OrozenBLMM, 也超越了SA-QQQ的状态, 包括零点A-SFA-R-DA 数据, 一个显著的版本, AS-FA-VA-VA-VA-S-S-LA-VA-VA-SDA-SDA-SDA-S-S-S-SD-VA-SD-S-S-S-S-S-S-SD-SD-S-S-LA-S-V-V-L-SD-SD-S-SD-LD-LD-Q 版本的版本的版本的版本数据, 一个重要版本的演示式版本数据, 一个重要的演示式的演示式的演示式的演示式的版本数据。