Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if the network reconstructs the reference signal perfectly or if the reference signal contains silence, e.g., when a two-output separator processes a single-speaker recording. Many modifications to the plain SDR have been proposed that trade-off between making the loss more robust and distorting its value. We propose to switch from a mean over the SDRs of each individual output channel to a global SDR over all output channels at the same time, which we call source-aggregated SDR (SA-SDR). This makes the loss robust against silence and perfect reconstruction as long as at least one reference signal is not silent. We experimentally show that our proposed SA-SDR is more stable and preferable over other well-known modifications when processing meeting-style data that typically contains many silent or single-speaker regions.
翻译:许多最先进的神经网络源分离系统使用平均信号对扭曲比率(SDR)作为培训目标功能。但是,如果网络完全重建参考信号,或者参考信号含有沉默,例如,当一个双输出分隔器处理单声波记录时,基本特别提款权是没有定义的。对普通特别提款权的许多修改建议是,在使损失更加稳健和扭曲其价值之间作出权衡。我们提议从每个单个输出渠道的比重转换为全球特别提款权,同时将所有产出渠道的比重转换为全球特别提款权,我们称之为源隔离特别提款权(SA-SDR)。只要至少有一个参考信号没有沉默,就使得失去沉默和完全重建成为强势。我们实验性地表明,在处理通常包含许多静音或单声调区域的会议模式数据时,我们提议的南南特别提款权比其他众所周知的修改更稳定、更可取。