Speech enhancement has recently achieved great success with various deep learning methods. However, most conventional speech enhancement systems are trained with supervised methods that impose two significant challenges. First, a majority of training datasets for speech enhancement systems are synthetic. When mixing clean speech and noisy corpora to create the synthetic datasets, domain mismatches occur between synthetic and real-world recordings of noisy speech or audio. Second, there is a trade-off between increasing speech enhancement performance and degrading speech recognition (ASR) performance. Thus, we propose an unsupervised loss function to tackle those two problems. Our function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss. Our results show that the proposed function effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset. While fully unsupervised training is unable to exceed the corresponding baseline, with joint super- and unsupervised training, the system is able to achieve similar speech quality and better ASR performance than the best supervised baseline.
翻译:最近,通过各种深层学习方法,增强语言能力最近取得了巨大成功。然而,大多数常规语言强化系统都经过了监督方法的培训,这带来了两大挑战。首先,大多数语言强化系统的培训数据集都是合成的。当将清洁的言语和吵闹的公司混合起来创建合成数据集时,合成和真实世界的语音或音频录音之间会出现域错配。第二,在提高语言强化性能和有辱人格的言语识别(ASR)性能之间存在着一种权衡。因此,我们建议了一种不受监督的损失功能来解决这两个问题。我们的职能是通过扩大MixIT损失功能,加上语音识别嵌入和分解损失来发展。我们的结果显示,拟议的功能有效地改善了语言增强性能,而与在噪音VoxCeleb数据集中以监督方式培训的基线相比,在不受监督的合成语音增强性能是有效的。尽管完全不受监督的培训无法超过相应的基线,通过联合的超超和不受监督的培训,但该系统能够实现与最佳监督基线相近的语音质量和更好的ASR性能。