Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective function. For the objective we propose to use a convolutive transfer function invariant Signal-to-Distortion Ratio (CI-SDR) based loss. While this is a well-known evaluation metric (BSS Eval), it has not been used as a training objective before. To show the effectiveness, we demonstrate the performance on LibriSpeech based reverberant mixtures. On this task, the proposed system approaches the error rate obtained on single-source non-reverberant input, i.e., LibriSpeech test_clean, with a difference of only 1.2 percentage points, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin.
翻译:时间上的培训标准已证明对分离单通道非反动语音混合物非常有效。 同样,基于遮罩的波束成型在多通道变动语音增强和源分离中表现出了令人印象深刻的性能。 我们在这里提议将神经网络支持的多通道源分离与时间- 部位培训目标功能结合起来。 为了实现我们提议的在异端信号对扭曲比率(CI-SDR)基础上损失时使用同流传输功能的目标。 虽然这是一个众所周知的评价指标( BSSS Eval),但它以前没有被用作培训目标。 要显示效果,我们展示基于 LibriSpeech 的静音混合物的性能。 在这项工作中,拟议系统采用单源非静电输入的误差率,即 LibriSpeech 测试纯度,只有1.2个百分点的差,因此比基于常规变异性培训系统和其他目标(如规模变异性信号对流率大的差率率)。