With recent research advancements, deep learning models are becoming attractive and powerful choices for speech enhancement in real-time applications. While state-of-the-art models can achieve outstanding results in terms of speech quality and background noise reduction, the main challenge is to obtain compact enough models, which are resource efficient during inference time. An important but often neglected aspect for data-driven methods is that results can be only convincing when tested on real-world data and evaluated with useful metrics. In this work, we investigate reasonably small recurrent and convolutional-recurrent network architectures for speech enhancement, trained on a large dataset considering also reverberation. We show interesting tradeoffs between computational complexity and the achievable speech quality, measured on real recordings using a highly accurate MOS estimator. It is shown that the achievable speech quality is a function of network complexity, and show which models have better tradeoffs.
翻译:随着最近的研究进展,深层次的学习模式在实时应用中正在成为增强语音的吸引性和强大选择。尽管最先进的模式可以在语言质量和背景噪音减少方面取得突出成果,但主要的挑战是如何获得足够紧凑的模型,在推论期间是资源效率高的。数据驱动方法的一个重要但经常被忽视的方面是,只有在用现实世界数据测试并用有用的指标进行评估时,结果才能令人信服。在这项工作中,我们调查了相当小的经常性和动态经常网络结构,用于增强语音,在大型数据集方面进行了培训,同时考虑到了反动。我们展示了计算复杂性与可实现的语音质量之间的令人感兴趣的权衡取舍,用非常精确的 MOS 估计数据对真实记录进行了衡量。我们显示,可实现的语音质量是网络复杂性的函数,并展示了哪些模型有更好的取舍。