Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.
翻译:视觉故事说明是一项创造性和富有挑战性的任务,目的是为一系列图像自动生成类似故事的描述。 以往视觉故事说明方法生成的描述缺乏一致性,因为它们使用字级序列生成方法,没有充分考虑到判决的依附性。 为了解决这一问题,我们提议了一个新型的等级直观故事说明框架,分别以句级和字级语义为模型。 我们使用基于变压器的BERT网络为句子和文字嵌入嵌入内容。 然后我们使用一个等级LSTM网络:最底层LSTM接收来自BERT的句子矢量表示作为输入,以了解与图像对应的句子之间的依存性,而顶层LSTM负责生成相应的矢量表达方式,从底LSTM中提取投入。 实验结果表明,我们的模型在自动评价指标BLEU和CIDer下,超越了最密切相关的基线。 我们还展示了我们方法在人类评估方面的有效性。