For the lack of adequate paired noisy-clean speech corpus in many real scenarios, non-parallel training is a promising task for DNN-based speech enhancement methods. However, because of the severe mismatch between input and target speeches, many previous studies only focus on the magnitude spectrum estimation and remain the phase unaltered, resulting in the degraded speech quality under low signal-to-noise ratio conditions. To tackle this problem, we decouple the difficult target w.r.t. original spectrum optimization into spectral magnitude and phase, and a novel Cycle-in-Cycle generative adversarial network (dubbed CinCGAN) is proposed to jointly estimate the spectral magnitude and phase information stage by stage under unpaired data. In the first stage, we pretrain a magnitude CycleGAN to coarsely estimate the spectral magnitude of clean speech. In the second stage, we incorporate the pretrained CycleGAN with a complex-valued CycleGAN as a cycle-in-cycle structure to simultaneously recover phase information and refine the overall spectrum. Experimental results demonstrate that the proposed approach significantly outperforms previous baselines under non-parallel training. The evaluation on training the models with standard paired data also shows that CinCGAN achieves remarkable performance especially in reducing background noise and speech distortion.
翻译:由于在许多现实情景中缺乏适当的对称噪音清洁言语保护,非平行培训对于DNN的语音强化方法来说是一项很有希望的任务,然而,由于投入与目标演讲之间严重不匹配,许多先前的研究仅侧重于频谱量估计和未改变阶段,导致在信号与噪音比率低的情况下语言质量下降,导致在信号与噪音比率低的情况下,言语质量下降。为解决这一问题,我们将原创频谱优化纳入光谱规模和阶段,并建立一个新型的循环内基因对抗网络(dubbbed CinCGAN),以在未受影响的数据中按阶段联合估计光谱量和阶段信息阶段。在第一阶段,我们预设一个大规模循环GAN,以粗略地估计清洁言语的光度。在第二阶段,我们将受过培训的CcycanGAN与一个价值复杂的循环GAN作为循环结构,以同时恢复阶段信息并改进总体频谱。实验结果显示,拟议的C级阵列方法大大超出C级发言量和阶段性分析模型下的前基数级模型,还显示在非标准级的C级演讲培训中还完成了。