Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR). The goal of AAC is to generate natural language descriptions of contents in audio samples. We propose several approaches for end-to-end joint modeling of ASR and AAC tasks and demonstrate their advantages over traditional approaches, which model these tasks independently. A major hurdle in evaluating our proposed approach is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we also create a multi-task dataset by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset. We also perform extensive experimental evaluation and show improvements of our proposed methods as compared to existing state-of-the-art ASR and AAC methods.
翻译:在室内和室外环境中记录的语音样本往往受到二级音频源的污染。大多数端到端的音频识别系统要么通过增强语音或培养噪音-气旋模型去除这些背景声音。为了更好的示范解释性和整体理解,我们的目标是将自动化音频字幕和经过彻底研究的自动语音识别这一不断扩大的领域结合起来。AAC的目标是在音频样本中生成内容的自然语言描述。我们提出了多种方法,用于对ASR和AAC任务进行端到端联合建模,并展示其优于独立模拟这些任务的传统做法。评估我们拟议方法的一个主要障碍是缺少带有语音转录和音频字幕的标签音频数据集。因此,我们还创建了一个多功能数据集,将清洁的Wall Street Journal的语音资料与从音频卡数据集中选择的多种背景噪音混合在一起。我们还进行了广泛的实验性评估,并展示了与现有最新ASR和AAC方法相比,我们拟议方法的改进情况。