In this paper, we propose a transformer-based architecture, called two-stage transformer neural network (TSTNN) for end-to-end speech denoising in the time domain. The proposed model is composed of an encoder, a two-stage transformer module (TSTM), a masking module and a decoder. The encoder maps input noisy speech into feature representation. The TSTM exploits four stacked two-stage transformer blocks to efficiently extract local and global information from the encoder output stage by stage. The masking module creates a mask which will be multiplied with the encoder output. Finally, the decoder uses the masked encoder feature to reconstruct the enhanced speech. Experimental results on the benchmark dataset show that the TSTNN outperforms most state-of-the-art models in time or frequency domain while having significantly lower model complexity.
翻译:在本文中, 我们提出一个基于变压器的架构, 称为两阶段变压器神经网络( TSTNNN), 用于在时间域中端到端的语音解译。 提议的模式由编码器、 两阶段变压器模块( TSTM)、 掩码模块和解码器组成。 编码器将吵闹的语音输入到特征表达中。 TSTM 利用了四个堆叠的两阶段变压器块, 以高效地从编码器输出阶段提取本地和全球信息。 掩码模块将生成一个掩码, 与编码器输出相乘。 最后, 解码器使用掩码编码器特性来重建强化的语音。 基准数据集的实验结果显示, TSTNND 在时间或频率域内超越了最先进的最新模型, 而模型的复杂性则大大降低 。