Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task.
翻译:尽管视频字幕模型最近出现了,但如何基于背景知识(即针对特定领域场景的长而信息丰富的解说,带有适当的推理)生成生动、细致的视频描述仍远未得到解决,然而它具有广泛的应用,如自动体育叙述。在本文中,我们提出了GOAL,一个超过8.9k的足球视频剪辑、22k句子和42k知识三元组的基准,用于提出一种具有挑战性的新任务设置,即基于知识的视频字幕(KGVC)。此外,我们进行了现有方法的实验适应,以展示解决这个有价值的适用任务的困难和潜在方向。