Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.
翻译:最近的视频+语言数据集涵盖互动高度结构化的领域,例如教学视频,或互动文字化的领域,例如电视节目。这两个属性都可能导致假的提示,供模型而不是学习地面语言来利用。在本文件中,我们介绍了足球(或`occer')新颖数据集,即足球(或`soccer')视频,用英文转录的现场评论来突出视频。由于游戏过程不可预测,评论也是如此,因此它们成为调查动态语言定位的独特资源。我们还为以下任务提供了最先进的基线:框架重新排序、瞬间检索、现场评论检索和逐场现场评论生成。结果显示SOTA模型在大多数任务中表现得相当良好。我们讨论了这些结果的影响,并提出可以使用GOAL的新任务。我们的代码库可以在https://gitlab.com/grounded-sport-convai/goal-baselines上查阅。