In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models.
翻译:近年来,基于深层次学习的源的分离取得了令人印象深刻的成果。然而,大多数研究仍然对合成数据集的分离模型进行评估,同时,对智能语音数据最新技术的性能仍是一个未决问题。本文有助于以两种方式填补这一差距。首先,我们发布了真-M数据集,这是一组由众源组成的真实生活混合物。第二,我们解决了现实生活混合物的性能评估问题,因为不存在地面真相。我们绕过这一问题,仔细设计了一个盲目的天平-反差信号到噪音比(SI-SNR)神经估计仪。我们通过用户研究显示,我们的估计员可靠地评估了真实混合物的分离性能。SI-SNR测算器的性能预测确实与人类观点密切相关。此外,我们注意到,我们关于真-M数据集的估测仪所预测的性能趋势与在评价流行语言分离模型时在综合基准上取得的结果十分接近。