Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.
翻译:从一组图像中形成多语调故事的过程。 适当地包括视觉变异和在输入图像中捕捉到的背景信息是视觉故事讲述的最具有挑战性的方面之一。 因此, 从一组图像中开发的故事往往缺乏凝聚力、 相关性和语义关系。 在本文中, 我们提出一个新的视野变异模型, 用来描述一组图像为故事。 拟议的方法利用视觉变换器( View Tranger) 提取输入图像的不同特征。 首先, 输入图像被分为16X16 补丁, 并被捆绑到一个平坦的补丁的直线投图中。 从一个图像转换到多个图像补丁, 捕捉了输入的视觉模式的视觉多样性。 这些特征被用来作为双向- LSTM 的输入, 这是序列编码器的一部分。 这反映了所有图像补丁的过去和未来图像背景。 然后, 安装了一个关注机制, 用来增加输入到语言模型中的数据的歧视性能力, 即一个 Mogriter- LSTMM 。 我们的模型的性能展示, 展示了我们目前模型的模型和结果。