Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity
翻译:培训一个有效的视频和语言模型,直观地要求多个框架作为模型投入。然而,尚不清楚使用多个框架是否有利于下游任务,如果是的话,则尚不清楚使用多个框架是否值得大幅提高的计算和记忆成本。在这项工作中,我们探索了视频和语言学习的单一框架模式。在一套不同的视频和语言任务(包括文本到视频检索和视频问答)中,我们展示出一个令人惊讶的结果,即:如果在推论时间采用大规模培训前和适当的框架组合战略,一个单一框架培训的模型不考虑时间信息能够比使用多个框架进行培训的现有方法取得更好的业绩。这一结果显示,在流行的视频和语言数据集中存在着强烈的“静态外观偏差 ” 。因此,为了能够更全面地评价视频和语言模型,我们提议了两项新的检索任务,其依据是现有的精细的识别动作数据集,鼓励时间模型。我们的代码可在 https://github.com/jayeleicn/singalityality。