Most studies on speech enhancement generally don't consider the energy distribution of speech in time-frequency (T-F) representation, which is important for accurate prediction of mask or spectra. In this paper, we present a simple yet effective T-F attention (TFA) module, where a 2-D attention map is produced to provide differentiated weights to the spectral components of T-F representation. To validate the effectiveness of our proposed TFA module, we use the residual temporal convolution network (ResTCN) as the backbone network and conduct extensive experiments on two commonly used training targets. Our experiments demonstrate that applying our TFA module significantly improves the performance in terms of five objective evaluation metrics with negligible parameter overhead. The evaluation results show that the proposed ResTCN with the TFA module (ResTCN+TFA) consistently outperforms other baselines by a large margin.
翻译:关于语言增强的大多数研究一般不考虑用时间频率(T-F)表示语言的能量分布,这对于准确预测遮罩或光谱十分重要。在本文中,我们提出了一个简单而有效的T-F注意模块,制作了2-D注意地图,为T-F表示的光谱部分提供不同的权重。为了验证我们拟议的TFA模块的有效性,我们使用残余时间变迁网络作为主干网,对两个常用的培训目标进行广泛的实验。我们的实验表明,应用我们的TFA模块大大改进了五种客观评价指标的性能,其中的参数间接费用可忽略不计。评价结果表明,拟议的与TFA模块(ResTCN+TFA)的ResTCN始终比其他基线大宽。