Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
翻译:视频问题解答是一项艰巨的任务,需要共同理解语言输入、单个视频框架中的视觉信息以及视频中发生的事件的时间信息。 在本文中,我们提出一个新的多流视频编码器,用于视频问题解答,使用多个视频输入和新的视频文本迭代共授方法回答与视频有关的各种问题。我们实验性地评估了多个数据集的模型,如MSRVTT-QA、MSVD-QA、IVQA, 以大幅度的间距比以往的艺术水平高。 同时,我们的模型将所需的GFLOP从150-360减少到了67, 生成了一个高效的视频问题解答模型。