Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are objective time-domain metrics, they however produce worse scores on speech quality and intelligibility metrics and usually lead to worse speech recognition performance, compared with including a loss on magnitude. While this phenomenon has been experimentally observed by many studies, it is often not accurately explained and there lacks a thorough understanding on its fundamental cause. This paper provides a novel view from the perspective of the implicit compensation between estimated magnitude and phase. Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions support the validity of our view.
翻译:在复杂的时间-频率(T-F)域或时间域中,基于深神经网络(DNN)的端到端优化在音频(T-F)域或时域中显示出相当大的潜力,许多最近的研究都显示,在调音器分离方面有相当大的潜力。许多最近的研究都优化了仅仅在时间或复杂域中界定的损失功能,而没有包括重大损失。虽然如果评价指标是客观的时间-域指标,这种损失功能通常会产生更好的评分,但是在语言质量和智能计量方面却会产生更差的评分,而且通常导致更差的语音识别性表现,而不是包括重大损失。虽然许多研究都实验性地观察到这种现象,但往往没有准确的解释,而且对其根本原因缺乏透彻的理解。本文从估计的音量和阶段之间的隐含补偿角度提供了新的观点。根据调音器分离和在噪音-反响条件下强有力的自动语音识别(ASR)任务得出的分析结果支持我们的观点的有效性。