Audio captioning is the task of generating captions that describe the content of audio clips. In the real world, many objects produce similar sounds. It is difficult to identify these auditory ambiguous sound events with access to audio information only. How to accurately recognize ambiguous sounds is a major challenge for audio captioning systems. In this work, inspired by the audio-visual multi-modal perception of human beings, we propose visually-aware audio captioning, which makes use of visual information to help the recognition of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to process the video inputs, and incorporate the extracted visual features into an audio captioning system. Furthermore, to better exploit complementary contexts from redundant audio-visual streams, we propose an audio-visual attention mechanism that integrates audio and visual information adaptively according to their confidence levels. Experimental results on AudioCaps, the largest publicly available audio captioning dataset, show that the proposed method achieves significant improvement over a strong baseline audio captioning system and is on par with the state-of-the-art result.
翻译:音频字幕是制作描述音频剪辑内容的字幕的任务。 在现实世界中,许多物体产生类似的声音。 很难识别这些只提供音频信息的听觉模糊的音频事件。 如何准确识别模糊的音频是音频字幕系统面临的一大挑战。 在这项工作中,在对人的视听多模式感知的启发下,我们提出视觉觉悟的音频字幕,利用视觉信息帮助识别模糊的探测对象。 具体地说,我们引入一个现成的视觉编码器处理视频输入,并将提取的视觉特征纳入音频字幕系统。 此外,为了更好地利用多余的视听流的互补环境,我们提议一个视听关注机制,根据他们的信任程度适应视听信息。 音频卡的实验结果,即公众可获取的最大音频字幕数据集,显示拟议方法在强大的基线音频字幕系统上取得了显著改进,并与最新结果一致。