Video captioning is a popular task that challenges models to describe events in videos using natural language. In this work, we investigate the ability of various visual feature representations derived from state-of-the-art convolutional neural networks to capture high-level semantic context. We introduce the Weighted Additive Fusion Transformer with Memory Augmented Encoders (WAFTM), a captioning model that incorporates memory in a transformer encoder and uses a novel method, to fuse features, that ensures due importance is given to more significant representations. We illustrate a gain in performance realized by applying Word-Piece Tokenization and a popular REINFORCE algorithm. Finally, we benchmark our model on two datasets and obtain a CIDEr of 92.4 on MSVD and a METEOR of 0.091 on the ActivityNet Captions Dataset.
翻译:视频字幕是一项受欢迎的任务,它挑战了用自然语言描述视频中事件的模式。 在这项工作中,我们调查了来自最新进化神经网络的各种视觉特征显示能力,以捕捉高层语义环境。我们引入了带有内存增强元件元件的加权增殖变异器(WAFTM),这是一种将内存纳入变压器编码器并使用新颖方法的字幕模型,即引信特性,以确保对更重要的图像给予应有的重视。我们用Word-Piece Tokenization和流行的REINFORCE算法来说明我们通过应用Word-Pie Tokenization和REINFORCE算法实现的绩效的收益。最后,我们以两个数据集为基准,在MSVD上获得了92.4的CIDER,在活动网络显示数据集上获得了0.091的METEOR。