通过整合声音和语义信息提高自动音频字幕的性能 (Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information)

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the neural encoder-decoder architecture, and their decoder mainly uses acoustic information that is extracted from the CNN-based encoder. However, they have ignored semantic information that could help the AAC model to generate meaningful descriptions. This paper proposes a novel approach for automated audio captioning based on incorporating semantic and acoustic information. Specifically, our audio captioning model consists of two sub-modules. (1) The pre-trained keyword encoder utilizes pre-trained ResNet38 to initialize its parameters, and then it is trained by extracted keywords as labels. (2) The multi-modal attention decoder adopts an LSTM-based decoder that contains semantic and acoustic attention modules. Experiments demonstrate that our proposed model achieves state-of-the-art performance on the Clotho dataset. Our code can be found at https://github.com/WangHelin1997/DCASE2021_Task6_PKU

翻译：近年来,自动听力字幕(AAC)发展迅速,涉及声信号处理和自然语言处理,为音效剪辑生成人类可读的句子,目前的模型一般以神经编码器-代码器结构为基础,其解码器主要使用CNN的编码器提取的声学信息,然而,它们忽视了可以帮助AAC模型生成有意义的描述的语义信息。本文件提议了一种基于纳入语义和声学信息的自动听音字幕新颖方法。具体地说,我们的音频字幕模型由两个子模块组成。 (1)预训练的关键词编码器使用预先训练过的ResNet38来初始化其参数,然后用提取的关键词作为标签进行培训。 (2) 多式注意解码器采用一个基于LSTM的解码器,其中包含了语义和声学关注模块。实验表明,我们提议的模型在Clotho数据集上取得了最新艺术性表现。我们的代码可以在https://gihub.com/WangHelin1997/DCASASASASE20上找到。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【因果人工智能系统】106页ppt，Causal AI for Systems

专知会员服务

97+阅读 · 2021年8月28日