Due to the unprecedented breakthroughs brought about by deep learning, speech enhancement (SE) techniques have been developed rapidly and play an important role prior to acoustic modeling to mitigate noise effects on speech. To increase the perceptual quality of speech, current state-of-the-art in the SE field adopts adversarial training by connecting an objective metric to the discriminator. However, there is no guarantee that optimizing the perceptual quality of speech will necessarily lead to improved automatic speech recognition (ASR) performance. In this study, we present TENET, a novel Time-reversal Enhancement NETwork, which leverages the transformation of an input noisy signal itself, i.e., the time-reversed version, in conjunction with the siamese network and complex dual-path transformer to promote SE performance for noise-robust ASR. Extensive experiments conducted on the Voicebank-DEMAND dataset show that TENET can achieve state-of-the-art results compared to a few top-of-the-line methods in terms of both SE and ASR evaluation metrics. To demonstrate the model generalization ability, we further evaluate TENET on the test set of scenarios contaminated with unseen noise, and the results also confirm the superiority of this promising method.
翻译:由于深层学习带来的前所未有的突破,语音增强技术得到迅速发展,在采用声学模型之前发挥重要作用,以减轻对语言的噪音影响。为了提高语言的感知质量,SE领域目前最先进的艺术通过将客观指标与歧视者联系起来,采取对抗性培训。然而,不能保证优化语言的感知质量必然导致提高自动语音识别(ASR)性能。在本研究中,我们介绍了一种新型的TENET,即时反向增强新颖的NETwork,它利用了输入的噪音信号本身的转换,即时间反转版本,与Siames网络和复杂的双向变异器一起,促进SE的音性能。在Voicebank-DEAND数据集上进行的广泛实验显示,与SE和ASR评价指标方面的少数最高级方法相比,TENET能够取得最新的结果。为了展示模型的普及能力,我们进一步评估了这一令人乐观的、高超能度、高能度、高能测测测测测测度方法。