Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.
翻译:视觉故事是建立在图片流基础上的短故事。 与现有的视觉字幕不同, 故事叙述的目的不仅包含事实描述, 而且还包含像人类一样的叙事和语义。 然而, VIST 数据集仅包含每个故事的少量固定照片。 因此, 视觉故事描述的主要挑战在于用叙述和富有想象力的故事填补照片之间的视觉差距。 在本文中, 我们提议明确学习一个连接视觉差距的故事线。 在培训期间, 一张照片或多张照片被随机从输入堆中省略出来, 我们训练网络以生成一个完全合理的故事, 即使缺少照片。 此外, 我们提议视觉故事讲述一个隐藏和图案模型, 设计该模型的目的是学习相隔相片流的非本地关系, 并改进和完善基于 RNNN 的常规模型。 在实验中, 我们展示我们的隐藏和图案计划, 以及网络设计在讲述故事时确实有效, 我们的模型超越了先前的状态艺术方法。 最后, 我们从质量上展示了在自动测量中所学的视觉差距。