The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive and residual manner. Given a user-specific text description, our algorithm is able to select semantically relevant video segments and produce a temporally aligned video summary. In order to evaluate our textually customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also show that our method is able to generate semantically diverse video summaries by only utilizing the learned visual embeddings.
翻译:长长视频的最佳摘要因高度主观性而不同, 不同的人之间不同。 即使对同一人来说, 最好的摘要也会随着时间或情绪的变化而改变。 在本文中, 我们引入了通过简单文本生成定制视频摘要的任务。 首先, 我们训练了一个深层的架构, 通过渐进和剩余的方式利用大量图像解析数据, 有效地学习视频框的语义嵌入。 根据用户特有的文本描述, 我们的算法能够选择语义相关视频段, 并制作一个时间一致的视频摘要。 为了评估我们的文字定制视频摘要, 我们与使用地面真相信息的基线方法进行了实验性比较。 尽管有挑战性的基线, 我们的方法仍然能够显示可比较甚至超强的性能。 我们还展示了我们的方法, 只能利用有学识的视觉嵌入, 才能生成具有语义多样性的视频摘要 。