In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.
翻译:在本文中,我们侧重于将变压器结构有效应用于视频字幕的问题。 香草变压器被提议用于单式语言生成任务, 如机器翻译。 但是, 视频字幕是一个多式学习问题, 视频特征在不同时间步骤之间有很大的冗余。 基于这些关切, 我们提议了一种叫作稀疏的边界觉变压器( SBAT)的新颖方法, 以减少视频演示中的冗余。 SBAT从多头目注意力中为分数使用边界认知聚合操作, 并从不同情景中选择不同的特征。 此外, SBAT 包括一个本地相关计划, 以补偿本地信息损失。 基于 SBAT, 我们进一步提出一个统一的跨式编码计划, 以促进多式互动。 两个基准数据集的实验结果显示, SBAT 超越了大多数指标下的最新方法 。