Unit selection synthesis systems required accurate segmentation and labeling of the speech signal owing to the concatenative nature. Hidden Markov model-based speech synthesis accommodates some transcription errors, but it was later shown that accurate transcriptions yield highly intelligible speech with smaller amounts of training data. With the arrival of end-to-end (E2E) systems, it was observed that very good quality speech could be synthesised with large amounts of data. As end-to-end synthesis progressed from Tacotron to FastSpeech2, it has become imminent that features that represent prosody are important for good-quality synthesis. In particular, durations of the sub-word units are important. Variants of FastSpeech use a teacher model or forced alignments to obtain good-quality synthesis. In this paper, we focus on duration prediction, using signal processing cues in tandem with forced alignment to produce accurate phone durations during training. The current work aims to highlight the importance of accurate alignments for good-quality synthesis. An attempt is made to train the E2E systems with accurately labeled data, and compare the same with approximately labeled data.
翻译:单位选择合成系统要求对语音信号进行准确的分解和标签。 隐藏的Markov 模式语音合成包含一些抄录错误, 但后来显示, 精确的抄录产生极易理解的语音, 其培训数据数量较少。 随着端对端系统( E2E) 的到来, 人们发现, 极好的语音合成可以与大量数据合成。 随着端对端合成从Tacotron到快速Speech2 的进展, 代表代理功能的特征对于高质量合成非常重要。 特别是, 子词组的长度很重要。 快速语音的变体使用教师模型或强制校准来获得高质量合成。 在本文中, 我们侧重于时间预测, 使用信号处理提示与强制校准同步一起制作准确的电话时间段。 目前的工作旨在强调准确校准对高质量合成的重要性。 试图用准确的标签数据对 E2E系统进行培训, 并与近标签的数据进行比较 。