Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant mixtures. Although this leads to a word error rate improvement by 7 percentage points compared to the standard SepFormer implementation, the system ends up with only marginally better performance than a PIT-BLSTM separation system, that is optimized with rather straightforward means. This is surprising and at the same time sobering, challenging the practical usefulness of many improvements reported in recent years for monaural source separation on nonreverberant data.
翻译:近年来,在神经网络单通道语音源分离方面取得了令人印象深刻的进展。但是,这些改进大多是在厌食数据上报告的,这种情况在实践中几乎难以实现。以SepFormer为起点,实现了对厌食混合物的最先进的性能,我们逐渐修改它,以优化其在反动混合物上的性能。尽管这导致与标准SepFormer实施相比,单词误差率提高了7个百分点,但这个系统最终的性能仅略好于PIT-BLSTM分离系统,该系统以相当简单的方式优化。这令人吃惊,同时也令人清醒地挑战近年来报告的许多改进对于非循环数据在月源分离方面的实际效用。