The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility- and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-realtime mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-ofthe-art baseline systems under various noisy and reverberant listening conditions.
翻译:语音的可感知性在出现环境噪音和反响时会严重退化。 在本文中,我们提出一个新的深层学习基础系统,以修改语音信号,从而在平等功率限制下增加其可知性,即修改前后的信号力必须相同。为此,我们使用基因对抗网络(GANs)获取基于时间频率的扩增系数,然后将这些因素应用到输入原始演讲中,以重新分配语音能量。我们不只优化单一的简单度量度,而是培训一个深神经网络(DNN)模型,以同时优化多个高级语音计量,包括智能度和质量相关计量,从而显著改善性能和稳健性。我们的系统不仅可以在非实时模式下运行音频回播,而且还支持实用的实时语音应用。使用客观测量和主观听觉测试的实验结果表明,拟议的系统在各种噪音和回声监听条件下大大优于最先进的基线系统。