Attention mechanism has been widely utilized in speech enhancement (SE) because theoretically it can effectively model the inherent connection of signal both in time domain and spectrum domain. Usually, the span of attention is limited in time domain while the attention in frequency domain spans the whole frequency range. In this paper, we notice that the attention over the whole frequency range hampers the inference for full-band SE and possibly leads to excessive residual noise. To alleviate this problem, we introduce local spectral attention (LSA) into full-band SE model by limiting the span of attention. The ablation test on the state-of-the-art (SOTA) full-band SE model reveals that the local frequency attention can effectively improve overall performance. The improved model achieves the best objective score on the full-band VoiceBank+DEMAND set.
翻译:语音增强(SE)中广泛使用了注意机制,因为理论上它可以在时间域和频谱域中有效地模拟信号的内在联系。通常,在时间域中,注意的范围有限,而频率域的注意遍及整个频率范围。在本文件中,我们注意到,整个频率范围的注意会妨碍全频域的推论,并可能导致过多的残余噪音。为了缓解这一问题,我们通过限制注意范围,将当地光谱注意(LSA)引入全频域SE模型中。对最新的SOTA全频域模型的消化试验表明,当地频率的注意能够有效地改善整体性能。改进后的模型在全频带语音Bank+DEMAND集上达到了最佳目标分数。