Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs.
翻译:从音频中探测常见事件和场景有助于在日常生活中提取和了解人类背景。先前的研究显示,从相关领域利用知识有助于定向声学事件探测过程。受日常生活中许多以人为中心的声学事件涉及语音要素这一观察的启发,本文件调查了将从公共语音数据集提取的高级别语音代表转移出去以丰富AED管道的潜力。为此,我们为在AED进程中联合学习声音和声学特征开发了双部门神经网络架构,并进行了透彻的经验研究,以审查公共音频[1] [1] 上不同类型投入的性能。我们的主要意见是:(1) 联合学习音频和声音投入提高了CNN基线(0.292 vs 0134 mAP)和TALNet[2] 基线(0.361 vs 0.351 mAP) 和TALNet[2] 的性能(0.351 mAP) ;(2) 增强额外语音特征对于以双重投入优化模型性能至关重要。