For the lack of adequate paired noisy-clean speech corpus in many real scenarios, non-parallel training is a promising task for DNN-based speech enhancement methods. However, because of the severe mismatch between input and target speech, many previous studies only focus on magnitude spectrum estimation and remain the phase unaltered, resulting in the degraded speech quality under low signal-to-noise ratio conditions. To tackle this problem, we decouple the difficult target $\emph{w.r.t.}$ original spectrum optimization into spectral magnitude and phase, and propose a novel Cycle-in-cycle generative adversarial network (dubbed CinCGAN) to jointly estimate the spectral magnitude and phase information stage by stage. In the first stage, we pretrain a magnitude CycleGAN to coarsely denoise the spectral magnitude spectrum. In the second stage, we couple the pretrained CycleGAN with a complex-valued CycleGAN as a cycle-in-cycle structure to recover phase information and refine the spectral magnitude simultaneously. The experimental results on the VoiceBank + Demand show that the proposed approach significantly outperforms previous baselines under non-parallel training. Experiments on training the models with standard paired data also show that the proposed method can achieve remarkable performance.
翻译:对于在许多现实情景中缺乏适当的对称噪音清洁言语保护,非平行培训对于基于 DNN 的语音强化方法来说是一项很有希望的任务。然而,由于投入与目标演讲之间严重不匹配,许多先前的研究只侧重于数量频谱估计,并且仍然没有改变阶段,导致信号到噪音比率低的情况下语言质量下降。为了解决这一问题,我们将最初的频谱优化到光谱级和阶段这一困难目标($emph{w.r.t.})分解为原始频谱优化,并提议建立一个全新的循环基因对抗网络(dubbed CinCGAN),以便按阶段联合估计光谱级和阶段信息阶段。在第一阶段,我们预设了一个规模的循环GAN,以粗略地淡化光谱级频谱频谱频谱频谱频谱频谱频谱谱。在第二阶段,我们把经过预先训练的CyleGAN与价值复杂的循环GAN相匹配成循环结构,以恢复阶段信息并同时改进光谱级。VeopBank+需求实验结果显示,拟议的方法也明显超越了先前的实验性模型。