While describing Spatio-temporal events in natural language, video captioning models mostly rely on the encoder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to unveil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e., video captioning domain. To this end, we propose Variational Stacked Local Attention Network (VSLAN), which exploits low-rank bilinear pooling for self-attentive feature interaction and stacking multiple video feature streams in a discount fashion. Each feature stack's learned attributes contribute to our proposed diversity encoding module, followed by the decoding query stage to facilitate end-to-end diverse and natural captions without any explicit supervision on attributes. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity. The CIDEr score of VSLAN outperforms current off-the-shelf methods by $7.8\%$ on MSVD and $4.5\%$ on MSR-VTT, respectively. On the same datasets, VSLAN achieves competitive results in caption diversity metrics.
翻译:视频字幕模型在用自然语言描述 Spatio 时空事件的同时,主要依赖 Vpatio- 时空事件, 视频字幕模型主要依赖 Vcolder 潜在视觉显示方式。 最近在 coder- decoder 模型上的进展主要在与解码器的线性互动中呈现编码器特征。 但是,视觉数据的模型复杂性日益增强,鼓励了微细刻度信息的更明确的特征互动,而目前视频字幕域目前没有这种互动。 此外,还使用了特征聚合方法来展示更丰富的视觉显示方式,或者通过连接,或者使用线性层。虽然视频语义重叠的功能组,但这些方法导致目标不匹配和功能冗余。 此外, 字幕的多样性是一个基本组成部分,从若干有意义的角度表达一个事件,目前缺少时间,即视频字幕说明域。 为此,我们建议Vcarational Stacked 本地关注网络(VSLAN) 利用低双线共享的自惯性数据互动,以及将多个视频流以贴现变式数据流。 每一个专题堆学习了我们当前SD 的内Slational- dal- dalal- dalgal- dal- dislations 在S- decolvelations 上的任何S- decal- decaldaldal- decalgationalational- scodudustrations 。