Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.
翻译:终端到终端文字语音模型(TTS)是否过度平衡?这些模型在多大程度上可以被切割,以及这些模型的合成能力会发生什么变化? 这项工作是探索光谱预测网络和电动计算机运行的起点。 我们彻底调查了超音速及其随后对合成言语的影响之间的权衡。 此外,我们探讨了TTS运行的几个方面:微调数据量与广度、TTS放大利用未开口的文本以及知识蒸馏和剪裁。我们的研究结果表明,不仅终端到终端TTS模型高度可运行,而且也许令人惊讶的是,纯 TTS模型能够产生相同或更高自然性和不可见性的合成话语。我们的所有实验都是在公开存在的模型上进行的,这项工作的调查结果都得到了大规模主观测试和客观措施的支持。我们提供了代码和200个运行的模型,以便利今后对TTS的效率进行研究。