Most previous neural text-to-speech (TTS) methods are mainly based on supervised learning methods, which means they depend on a large training dataset and hard to achieve comparable performance under low-resource conditions. To address this issue, we propose a semi-supervised learning method for neural TTS in which labeled target data is limited, which can also resolve the problem of exposure bias in the previous auto-regressive models. Specifically, we pre-train the reference model based on Fastspeech2 with much source data, fine-tuned on a limited target dataset. Meanwhile, pseudo labels generated by the original reference model are used to guide the fine-tuned model's training further, achieve a regularization effect, and reduce the overfitting of the fine-tuned model during training on the limited target data. Experimental results show that our proposed semi-supervised learning scheme with limited target data significantly improves the voice quality for test data to achieve naturalness and robustness in speech synthesis.
翻译:过去大多数神经文字对声音(TTS)方法主要基于有监督的学习方法,这意味着它们依赖于大型的培训数据集,在低资源条件下很难取得可比的性能。为了解决这个问题,我们提议对神经TS采用半监督的学习方法,其中贴有标签的目标数据有限,这也能够解决以往自动递减模型中暴露偏差的问题。具体地说,我们预先对基于快速语音的参考模型进行了培训,该模型含有大量源数据,对有限的目标数据集进行了微调。与此同时,原始参考模型生成的假标签被用来指导微调模型的进一步培训,实现正规化效果,并减少在有限目标数据培训期间对微调模型的过度配装。实验结果表明,我们提议的带有有限目标数据的半监督学习计划极大地提高了测试数据的声音质量,以便在语音合成中实现自然性和稳健。