Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event. Previous approaches usually follow a two-stage generative process, which first proposes a segment for each event, then renders a caption for each identified segment. Recent advances in large-scale sequence generation pretraining have seen great success in unifying task formulation for a great variety of tasks, but so far, more complex tasks such as dense video captioning are not able to fully utilize this powerful paradigm. In this work, we show how to model the two subtasks of dense video captioning jointly as one sequence generation task, and simultaneously predict the events and the corresponding descriptions. Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks such as end-to-end dense video captioning integrated into large-scale pre-trained models.
翻译:大量视频字幕旨在识别投入视频中感兴趣的事件,并为每个活动生成描述性说明。 以往的方法通常遵循两阶段的基因化过程,首先为每个活动提出一个段,然后为每个指定部分提供字幕。 大规模序列生成预科培训最近的进展在为各种任务统一任务制定任务方面取得了巨大成功, 但迄今为止,密集视频字幕等更为复杂的任务无法充分利用这一强大的范例。 在这项工作中,我们展示了如何将密集视频字幕的两层子任务作为一个序列生成任务进行模型,同时预测事件和相应的描述。 YouCook2和Vitt实验显示了令人鼓舞的结果,并表明了将终端到终端密集视频字幕纳入大规模预培训模式等复杂培训任务的可行性。