The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the long-range dependency in speech by self-attention. Therefore, it is introduced in speech enhancement to improve the speech estimation accuracy from a noise mixture. However, to address the computational cost issue in Transformer with self-attention, the axial attention is the option i.e., to split a 2D attention into two 1D attentions. Inspired by the axial attention, in the proposed method we calculate the attention map along both time- and frequency-axis to generate time and frequency sub-attention maps. Moreover, different from the axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis. Furthermore, it is proven in the literature that the lower frequency-band in speech, generally, contains more desired information than the higher frequency-band, in a noise mixture. Therefore, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA). The U-shaped Transformer is also first time introduced in the proposed method to further improve the speech estimation accuracy. The extensive evaluations over four public datasets, confirm the efficacy of the proposed method.
翻译:最近,在深层的学习中,变换器显示,有可能通过自我自省来利用语音的长期依赖性。因此,在语音增强中引入了该方法,以提高噪音混合物的语音估计准确性。然而,为了解决自省变换器中的计算成本问题,以自我注意的方式,轴心是一种选择,即将2D关注分为两个1D关注点。在拟议的方法中,我们计算时间和频率轴的注意分布图,以生成时间和频率次注意地图,从而产生时间和频率的分注意。此外,与轴心不同的是,拟议方法为时间和频率轴提供了两种平行的多头关注。此外,文献证明,低频带通常包含比高频带更可取的信息,在噪音混合物中。因此,拟议采用的频率波段意识关注度关注度建议,即高频带关注度(HFA)和低频带次注意生成时间和频率次注意地图。此外,拟议的方法与轴心不同,为时间轴,为时间轴提供了两种平行的多头关注度关注点。此外,拟议采用的变换式数据法还改进了对公众数据的准确度评估方法。