This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function.
翻译:本文建议对神经电码中使用的正常流( NF) 进行总体改进。 作为案例研究, 我们通过改进的平行波子网( PW) 改进表达式语音编码。 具体地说, 我们提议将PW的偏差和信号质量差距从原来的PW 缩小为10美元, 将其他最先进的神经电解系统缩小为60美元以上。 我们还展示了在波形重建以及文本到语音( TTS) 任务中, 改进了接收信号质量和自然性。 我们评估了多发式多语种多语种多语种数据集的不同语调风格的模型。 在波形重建任务中, 拟议的模型将原PW 的自然性和信号质量差距缩小为10美元, 将其他状态的神经电解码系统缩小为10美元以上。 我们还展示了使用L2 Spectoral 距离和跨读音频调( TTTTS) 和跨Enprespy( Text) 设置的评价测试的客观度指标的改进了3美元和 6\ uncodecode{x30} Prefristal- pristal- preview 和不具有一种概率的硬度。