While the deep learning techniques promote the rapid development of the speech enhancement (SE) community, most schemes only pursue the performance in a black-box manner and lack adequate model interpretability. Inspired by Taylor's approximation theory, we propose an interpretable decoupling-style SE framework, which disentangles the complex spectrum recovery into two separate optimization problems \emph{i.e.}, magnitude and complex residual estimation. Specifically, serving as the 0th-order term in Taylor's series, a filter network is delicately devised to suppress the noise component only in the magnitude domain and obtain a coarse spectrum. To refine the phase distribution, we estimate the sparse complex residual, which is defined as the difference between target and coarse spectra, and measures the phase gap. In this study, we formulate the residual component as the combination of various high-order Taylor terms and propose a lightweight trainable module to replace the complicated derivative operator between adjacent terms. Finally, following Taylor's formula, we can reconstruct the target spectrum by the superimposition between 0th-order and high-order terms. Experimental results on two benchmark datasets show that our framework achieves state-of-the-art performance over previous competing baselines in various evaluation metrics. The source code is available at github.com/Andong-Lispeech/TaylorSENet.
翻译:虽然深层次的学习技术促进了语音增强社区(SE)的快速发展,但大多数计划只是以黑盒方式追求功能,缺乏适当的模型解释。在泰勒近似理论的启发下,我们提出了一个可解释的脱钩SE型框架,将复杂的频谱恢复分解成两个不同的优化问题( emph{ i. e. ) 、 数量和复杂的剩余估计。具体地说,作为泰勒系列的第0级术语,一个过滤网络是精心设计的,以便仅压制音效域中的噪声组件,并获得粗略的频谱。为了改进阶段分布,我们估计了被定义为目标与粗光谱光谱之间的差别的稀薄复杂剩余部分,并测量了阶段差距。在本研究中,我们将残余部分分为两个高调泰勒术语的组合,并提出一个轻量的训练模块,以取代相邻的复杂的衍生工具操作者。最后,我们可以按照泰勒的公式,通过在 0- 级和高端频谱的频谱中设置来重建目标频谱。在两个基准数据网络上的实验性结果显示,我们的框架在以前的基线/基准代码上实现了业绩。