Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant mixtures. Although this leads to a word error rate improvement by 8 percentage points compared to the standard SepFormer implementation, the system ends up with only marginally better performance than our improved PIT-BLSTM separation system, that is optimized with rather straightforward means. This is surprising and at the same time sobering, challenging the practical usefulness of many improvements reported in recent years for monaural source separation on nonreverberant data.
翻译:近年来,在神经网络单通道语音源分离方面取得了令人印象深刻的进展。但是,这些改进大多是在厌食数据方面报告的,这种情况在实践中几乎难以实现。 将SepFormer作为起点,在厌食混合物方面达到最先进的性能,我们逐渐修改它,以优化其反动混合物的性能。 虽然这导致与标准SepFormer实施相比,单词误差率提高了8个百分点,但这个系统最终的性能仅略好于我们改进过的PIT-BLSTM分离系统,该系统以相当简单的方式优化。 令人惊讶的是,同时也令人清醒的是,近年来报告的许多改进对于将非恒定数据与月源分离的实际效用提出了挑战。