Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.
翻译:合成具有表现力的日语角色语音面临音高重音敏感性和风格多样性的独特挑战。本文通过实证评估,比较了两种开源语音合成模型——VITS与Style-BERT-VITS2 JP Extra(SBV2JE)——在领域内、角色驱动的日语语音上的表现。利用三个角色专用数据集,我们从自然度(平均意见得分与对比平均意见得分)、可懂度(词错误率)及说话人一致性三个维度评估模型性能。SBV2JE在自然度上接近人类录音水平(MOS 4.37对比4.38),实现了更低的词错误率,并在对比平均意见得分中略占优势。凭借音高重音控制模块和基于WavLM的判别器增强,SBV2JE在语言学习和角色对话生成等应用中表现出有效性,尽管其计算需求较高。