The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.
翻译:SepFormer 结构在语音分离方面显示出非常好的结果。 与其他学习的编码模型一样,它使用短框,因为在这些情况下,它们表现得更好。 这导致大量输入框架,这是个问题;由于SepFormer以变压器为基础,其计算复杂性随着较长的顺序而急剧增加。 在本文中,我们使用SepFormer 来做一个语音强化任务,并表明,通过以短期短期四级变换(STFT)代表的规模取代学习的编码特征,我们可以使用长框,而不会影响感知增强性能。 我们获得了同等质量和智能评估分数,同时将10秒的音量减少了约8倍。