Deep learning-based speech enhancement has seen huge improvements and recently also expanded to full band audio (48 kHz). However, many approaches have a rather high computational complexity and require big temporal buffers for real time usage e.g. due to temporal convolutions or attention. Both make those approaches not feasible on embedded devices. This work further extends DeepFilterNet, which exploits harmonic structure of speech allowing for efficient speech enhancement (SE). Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in real-time. The DeepFilterNet framework can be obtained under an open source license.
翻译:深层学习式语音增强已经取得了巨大的改进,最近还扩大到了全频带音频(48 kHz),然而,许多方法的计算复杂程度相当高,需要大量时间缓冲来实时使用,例如由于时间变幻或注意力的缘故。这两种方法都使这些方法在嵌入装置上不可行。这项工作进一步扩展了DeepFilterNet,它利用了调音结构来有效增强语音(SE)。培训程序、数据扩增和网络结构方面的一些优化导致了最新的SE性能,同时将笔记本的Core-i5 CPU的实时因数降至0.04。这使DeepFilterNet的算法适用于在嵌入装置上实时运行。DeepFilterNet框架可以在开放源许可下获得。