Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-benchmark.github.io/.
翻译:多数现有的视频和语言(VidL)研究侧重于单一数据集或单一任务的多个数据集。在现实中,一个真正有用的VidL系统预计将容易被广泛推广到不同任务、领域和数据集。为了便于评价这些系统,我们引入了视频和语言理解评价(VALUE)基准,这是11个VidL数据集的组合,涵盖三项广受欢迎的任务:(一) 文本到视频检索;(二) 视频回答;(三) 视频字幕。 VALUE基准旨在涵盖广泛的视频genes、视频长度、数据量和任务难度等内容。我们不仅侧重于仅包含视觉信息的单频道视频视频视频视频视频视频视频,还促进利用视频框架及其相关字幕信息的模式,以及共享多种任务知识的模式。我们评估了各种基线方法,同时不进行大规模 VidL 培训前,并系统调查视频输入渠道、聚合方法和不同视频展示模型的影响。我们还研究了不同任务/ VigIVL 的可转让性,同时研究在不同的视频定位模型下学习人类未来任务和多种任务之间的可转让性。