Most recent speech synthesis systems are composed of a synthesizer and a vocoder. However, the existing synthesizers and vocoders can only be matched to acoustic features extracted with a specific configuration. Hence, we can't combine arbitrary synthesizers and vocoders together to form a complete system, not to mention apply to a newly developed model. In this paper, we proposed Universal Adaptor, which takes a Mel-spectrogram parametrized by the source configuration and converts it into a Mel-spectrogram parametrized by the target configuration, as long as we feed in the source and the target configurations. Experiments show that the quality of speeches synthesized from our output of Universal Adaptor is comparable to those synthesized from ground truth Mel-spectrogram no matter in single-speaker or multi-speaker scenarios. Moreover, Universal Adaptor can be applied in the recent TTS systems and voice conversion systems without dropping quality.
翻译:最新的语音合成系统由合成器和电磁转换器组成。 然而, 现有的合成器和电动转换器只能与以特定配置提取的声学特征相匹配。 因此, 我们无法将任意合成器和电动转换器结合起来形成完整的系统, 更不用说适用于新开发的模型了。 在本文中, 我们提出了通用调制器, 它使用源配置的梅尔光谱合成器, 并将它转换成由目标配置的梅尔光谱合成器, 只要我们在源和目标配置中填充。 实验显示, 我们的通用调制器的产出所合成的演讲的质量可以与从地面真象中合成的单方言或多方言情景中没有任何物质。 此外, 通用调制能在最新的 TTS 系统和语音转换系统中应用, 且不丢弃质量 。