This work adapts two recent architectures of generative models and evaluates their effectiveness for the conversion of whispered speech to normal speech. We incorporate the normal target speech into the training criterion of vector-quantized variational autoencoders (VQ-VAEs) and MelGANs, thereby conditioning the systems to recover voiced speech from whispered inputs. Objective and subjective quality measures indicate that both VQ-VAEs and MelGANs can be modified to perform the conversion task. We find that the proposed approaches significantly improve the Mel cepstral distortion (MCD) metric by at least 25% relative to a DiscoGAN baseline. Subjective listening tests suggest that the MelGAN-based system significantly improves naturalness, intelligibility, and voicing compared to the whispered input speech. A novel evaluation measure based on differences between latent speech representations also indicates that our MelGAN-based approach yields improvements relative to the baseline.
翻译:这项工作调整了两个最新的基因模型结构,并评价了这些模型在将低声语音转换为正常语音方面的效力。我们把正常目标演讲纳入了病媒定量变异自动转换器(VQ-VAEs)和MelGANs的培训标准,从而为系统从低声输入中恢复语音提供了条件。客观和主观质量措施表明,VQ-VAEs和MelGANs都可以修改来完成转换任务。我们发现,拟议的方法大大改善了Mel Cepstral扭曲(MCD)衡量标准,比DiscoGAN基线至少提高了25%。主观的监听测试表明,以MelGAN为基础的系统大大改善了自然性、智能性和与低语输入演讲相比的表达方式。基于潜在语音表现差异的新评价措施还表明,我们基于MelGAN(MelGAN)的方法比基准提高了25%。