We study the impact of visual assistance for automated audio captioning. Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating a variety of pretrained features. We perform experiments on a YouTube-based audiovisual data set and investigate the effect of applying the considered transfer learning technique in terms of a variety of captioning metrics. We find that only one of the considered kinds of pretrained features provides consistent improvements, while the others do not provide any noteworthy gains at all. Interestingly, the outcomes of prior research efforts indicate that the exact opposite is true in the case of sound event detection, leading us to conclude that the optimal choice of visual embeddings is strongly dependent on the task at hand. More specifically, visual features focusing on semantics appear appropriate in the context of automated audio captioning, while for sound event detection, time information seems to be more important.
翻译:我们研究了自动字幕的视觉辅助作用。 利用多编码变压器结构(以前曾用于在音响事件探测中引入与视觉有关的信息),我们分析了纳入各种预先训练的功能的效用。我们在以YouTube为基础的视听数据集上进行了实验,并调查了应用考虑的传导学习技术在各种字幕指标方面的效果。我们发现,只有一种经过预先训练的功能提供了一致的改进,而其他功能则根本没有提供任何值得注意的收益。有趣的是,先前的研究成果表明,在探测声音事件的情况下,完全相反的情况是真实的,导致我们得出结论,视觉嵌入的最佳选择在很大程度上取决于手头的任务。 更具体地说,侧重于语言学的视觉特征在自动语音字幕方面似乎是合适的,而对于探测声音事件来说,时间信息似乎更为重要。