自动音频字幕自动直观协助的影响 (Impact of visual assistance for automated audio captioning)

We study the impact of visual assistance for automated audio captioning. Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating a variety of pretrained features. We perform experiments on a YouTube-based audiovisual data set and investigate the effect of applying the considered transfer learning technique in terms of a variety of captioning metrics. We find that only one of the considered kinds of pretrained features provides consistent improvements, while the others do not provide any noteworthy gains at all. Interestingly, the outcomes of prior research efforts indicate that the exact opposite is true in the case of sound event detection, leading us to conclude that the optimal choice of visual embeddings is strongly dependent on the task at hand. More specifically, visual features focusing on semantics appear appropriate in the context of automated audio captioning, while for sound event detection, time information seems to be more important.

翻译：我们研究了自动字幕的视觉辅助作用。利用多编码变压器结构(以前曾用于在音响事件探测中引入与视觉有关的信息),我们分析了纳入各种预先训练的功能的效用。我们在以YouTube为基础的视听数据集上进行了实验,并调查了应用考虑的传导学习技术在各种字幕指标方面的效果。我们发现,只有一种经过预先训练的功能提供了一致的改进,而其他功能则根本没有提供任何值得注意的收益。有趣的是,先前的研究成果表明,在探测声音事件的情况下,完全相反的情况是真实的,导致我们得出结论,视觉嵌入的最佳选择在很大程度上取决于手头的任务。更具体地说,侧重于语言学的视觉特征在自动语音字幕方面似乎是合适的,而对于探测声音事件来说,时间信息似乎更为重要。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日