Research on text generation from multimodal inputs has largely focused on static images, and less on video data. In this paper, we propose a new task, narration generation, that is complementing videos with narration texts that are to be interjected in several places. The narrations are part of the video and contribute to the storyline unfolding in it. Moreover, they are context-informed, since they include information appropriate for the timeframe of video they cover, and also, do not need to include every detail shown in input scenes, as a caption would. We collect a new dataset from the animated television series Peppa Pig. Furthermore, we formalize the task of narration generation as including two separate tasks, timing and content generation, and present a set of models on the new task.
翻译:对多式联运投入的文字生成的研究主要侧重于静态图像,而不是视频数据。在本文中,我们提议一项新的任务,即“叙事生成”,即以将在多个地方插入的解说文本来补充视频。这些叙事是视频的一部分,有助于该视频中出现的故事线。此外,它们具有背景信息性,因为它们包括适合其覆盖视频时间框架的信息,也不需要将输入场所显示的所有细节都作为标题列入。我们从动画电视系列Peppa Pig中收集了一套新的数据集。此外,我们把叙事生成的任务正式化为包括两个单独的任务、时间和内容生成,并提出了一套关于新任务的模式。