This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.
翻译:本文建议以框架级别序列到序列模型为基础进行语音合成( SVS) 。 在 SVS 中, 关键是要将歌唱时间与以分数代表的时间结构同步, 同时考虑到音时和音符开始时间之间的差异。 在许多 SVS 系统中, 包括我们先前的工作, 电话级别评分功能根据外部校对者为考虑音时偏差而获得的电话界限转换为框架级别。 因此, 音质受到此系统中校对器准确性的影响。 为了缓解这一问题, 我们引入了带框架级别特征的注意机制。 在拟议系统中, 注意机制吸收了电话界限的校准错误。 此外, 我们用在没有校对器时, 以音乐评分为主的超音频规则定义的系统进行评估。 实验结果显示了拟议系统的有效性 。