In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.
翻译:在本论文中,我们提出了一种可分微分的WORLD合成器,并将其用于端到端的音频样式转换任务,如(歌曲)声音转换和DDSP音色转换任务。因此,我们的基线可分微分合成器没有模型参数,但却可以产生足够的合成质量。我们可以通过添加轻量级黑盒后置网络来扩展基线合成器,以便在进一步处理基线输出时提高保真度。另一种可分微分方法直接考虑提取源激励谱,这可以改善自然度,尽管仅适用于更窄的样式转换应用。我们方法中使用的声学特征参数化具有将音高和音色信息自然分离的附加好处,以便可以分别对其进行建模。此外,由于存在从单声道音频源估计这些声学特征的稳健方法,因此它允许向端到端目标函数添加参数损失项,从而有助于收敛和/或进一步稳定(对抗性)训练。