Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of unimpaired proper speech examples, and the generator is trained to reconstruct the original proper speech. We evaluated the performance of our system on phoneme replacement of minimal pair words of English as well as on children with pronunciation disorders. Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.
翻译:学习新语言需要不断地将语音制作与来自环境的参考制作进行对比。 在获取语音时, 儿童会做出与护理员的演讲相匹配的动脉调整。 成长后学习一种语言的学习者将语言调整为与导师的参考匹配。 本文提出了合成生成正确发音反馈的方法, 不正确制作了错误的发音反馈。 此外, 我们的目标是在保持发言者原声的同时生成校正的制作。 系统会促使用户发声。 语音记录在记录中, 与不准确的电话相伴的样本用零遮盖。 这个波形体可以作为语音生成器的输入器, 用 U- net 结构进行深度的修饰系统实施, 并培训其输出重塑的语音。 训练组由无瑕疵的适当语音示例组成, 并训练生成者重建原有的正确语音。 我们评估了我们的系统在电话机替换英语最小配音的功能以及有发音障碍的儿童的功能。 结果显示, 人类听众会略地更喜欢我们生成的语音而不是用不同发言者制作的平滑的语音替换。