Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.
翻译:视觉故事图示 ~ (VIST) 是一项任务, 要根据给定的相片流讲述关于某个主题的叙事故事故事。 现有研究的重点是设计复杂的模型, 这些模型依赖大量的人类附加说明的数据。 然而, VIST 的注解费用极高, 由于长尾专题的分布, 培训数据集无法涵盖许多专题。 在本文中, 我们侧重于通过考虑微小的场景来增强VIST 模型的概括性能力。 在人类讲故事的方式的启发下, 我们提议一个专题适应性故事家来模拟跨主题的概括化能力。 在实践中, 我们应用基于梯度的元学习算法在多模式的后继2seq 模型上使模型能够从主题到主题的迅速适应。 此外, 我们进一步提议一个原型编码结构, 以模型来模拟本主题的衍生能力。 具体地说, 我们编码并恢复少数培训故事文本, 以作为导引论时代的参考。 实验结果显示, 主题的适应和原型编码性编码结构会给少数光谱模型的模型带来更多的应用。