End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these errors, allowing for a redesign with expert knowledge or transfer learning.
翻译:端到端学习模式表现出了实施语言隔离的非凡能力。 尽管它们具有广泛的现实应用范围,但是对于它们所使用的组合机制却知之甚少,因此将个别演讲者隔离开来。我们知道协调性是这些网络向分组来源的关键提示,在这项工作中,我们对ConvTasnet 和 DPT-Net 进行彻底调查,分析它们如何对输入混合物进行调和分析。我们进行通缩研究,在应用低通、高通和波段过滤器对不同传频带进行实证性分析时,我们使用不同传频带的过滤器,以分析对隔离至关重要的口音。我们还调查这些网络如何决定通过合成混合物引入不连续性来分配估计源的哪个输出渠道。我们发现,终端到端到端网络非常不稳定,在遇到人类无法察觉的变形时表现很差。我们将这些网络的电离子安装成光谱,可以降低总体性能,但稳定性要高得多。 这项工作有助于我们了解这些网络依赖什么信息来进行语音隔离,并且将两个来源暴露为一般网络的精密性学习者。