Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce Plug-and-Play Story Teller (PPST) and improve image-to-story generation by: 1) alleviating the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a fluent image-to-text generation with minimal supervision, and 2) enabling a more style-relevant generation by incorporating stylistic adapters to control the story generation. We conduct image-to-story generation experiments with non-styled, romance-styled, and action-styled PPST approaches and compare our generated stories with those of previous work over three aspects, i.e., story coherence, image-story relevance, and style fitness, using both automatic and human evaluation. The results show that PPST improves story coherence and has better image-story relevance, but has yet to be adequately stylistic.
翻译:与图像字幕不同的是,从图像生成一个短故事是艰巨的。 与图像字幕不同,从图像生成的故事带来了多重挑战:保存故事的一致性,适当评估故事的质量,将生成的故事引导到一定的风格,解决在培训过程中限制监管的图像相配参考数据集稀缺的问题。 在这项工作中,我们引入了插图和播放故事导体(PPST ), 并通过以下方式改进图像生成到故事生成:1) 将大型预科模型(即 CLIP 和 GPT-2 ) 纳入大型的预科模型( CLIP 和 GPT-2 ), 从而缓解数据稀缺问题, 从而便利流利的图像到文字生成, 且在最小的监管下, 以及 2, 通过引入时尚的调调调控器来促成更具有风格相关性的生成。 我们用非风格、 浪漫型和动作型的 PPST 方法进行图像生成实验, 并将我们生成的故事与以往工作的故事在三个方面进行比较, 即故事的一致性、 图像相关性, 并且具有更好的图像相关性。