For the lack of adequate paired noisy-clean speech corpus in many real scenarios, non-parallel training is a promising task for DNN-based speech enhancement methods. However, because of the severe mismatch between input and target speech, many previous studies only focus on the magnitude spectrum estimation and remain the phase unaltered, resulting in the degraded speech quality under low signal-to-noise ratio conditions. To tackle this problem, we decouple the difficult target w.r.t. original spectrum optimization into spectral magnitude and phase, and a novel Cycle-in-Cycle generative adversarial network (dubbed CinCGAN) is proposed to jointly estimate the spectral magnitude and phase information stage by stage under unpaired data. In the first stage, we pretrain a magnitude CycleGAN to coarsely estimate the spectral magnitude of clean speech. In the second stage, we incorporate the pretrained CycleGAN in a complex-valued CycleGAN as a cycle-in-cycle structure to simultaneously recover phase information and refine the overall spectrum. Experimental results demonstrate that the proposed approach significantly outperforms previous baselines under non-parallel training. The evaluation on training the models with standard paired data also shows that CinCGAN achieves remarkable performance especially in reducing background noise and speech distortion.
翻译:由于在许多现实情景中缺乏适当的对称噪音清洁言语保护,对基于DNN的语音强化方法而言,非平行培训是一项很有希望的任务,然而,由于投入与目标演讲之间严重不匹配,许多先前的研究仅侧重于频谱量估计和未改变阶段,导致在信号与噪音比率低的情况下言论质量下降,导致在信号与噪音比率低的情况下,语气质量下降。为解决这一问题,我们将最初的频谱优化目标w.r.t.分解为光谱规模和阶段,并建立一个新型的循环-循环基因对抗网络(dubbbed CinCGAN),以在未受影响的数据下阶段联合估计光谱级规模和阶段信息阶段。在第一阶段,我们先进行大规模循环GAN,以粗略地估计清洁言语的光谱程度。在第二阶段,我们将未受过训练的CcycAN作为循环结构,以同时恢复阶段信息并改进总体频谱。实验结果显示,拟议采用的C级规模和阶段信息分析方法大大超出C级语言分析前的基线,在不甚为标准的C级语言培训中也显示,在不甚高水平的C级语言评估中实现了。