Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks mapping the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral information to boost up performance. Despite an improvement in enhancement performance of mono output, the spatial image preservation and subjective evaluations have not gained much attention in the literature. This paper proposes a novel stereo-aware framework for speech enhancement, i.e., a training loss for deep learning-based speech enhancement to preserve the spatial image while enhancing the stereo mixture. The proposed framework is model independent, hence it can be applied to any deep learning based architecture. We provide an extensive objective and subjective evaluation of the trained models through a listening test. We show that by regularizing for an image preservation loss, the overall performance is improved, and the stereo aspect of the speech is better preserved.
翻译:近年来,最受欢迎的单一语言强化框架是端对端网络,将噪音混合物绘制成清洁语言的估计。随着计算能力和多声道麦克风记录不断增长,先前的工作旨在纳入空间统计以及光谱信息,以提高性能。尽管单一产出的性能有所提高,但空间图像保护和主观评价在文献中并未获得多大重视。本文件提议了一个新的声音强化立体意识框架,即深学习语音强化培训损失,以维护空间图像,同时加强立体混合。拟议框架是独立的模式,因此可以适用于任何深层学习基础架构。我们通过监听测试对经过培训的模型进行广泛的客观和主观评价。我们显示,通过对图像保护损失进行常规化,总体性能得到改进,而且声音方面得到更好的保存。