Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody while maintaining speaker identity. To address this issue, we propose a VC system with explicit prosodic modelling and deep speaker embedding (DSE) learning. First, a speech-encoder strives to extract robust phoneme embeddings from atypical speech. Second, a prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation. Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility. A comparison of speech recognition results between the original dysarthric speech and converted speech show that absolute reduction of 47.6% character error rate (CER) and 29.3% word error rate (WER) can be achieved.
翻译:尽管在典型演讲的语音转换(VC)方面取得了显著进展,但对于非典型演讲(例如,disarthric和第二语言(L2)语言)的 VC 仍是一个挑战,因为它涉及在保持发言者身份的同时纠正非典型的亲体特征,因此需要纠正非典型的亲体特征。为了解决这一问题,我们提议了一个VC系统,具有明确的Prosodi模拟和深层语音嵌入(DSE)学习。首先,一个语音编码器努力从非典型演讲中提取强大的电话嵌入嵌入。第二,一个假话校正器在电话嵌入中,以推断典型的语音持续时间和音调值。第三,转换模式将电话嵌入和典型的亲体特征作为生成转换语音的输入,以通过发言人编码器或扩音器调整来学习的目标DSE为条件。广泛的实验表明,发言者的适应可以达到较高的发言者相似性,而基于语音编码的转换模式可以大大减少音调和非感官的读音模式,从而改进了语音感应变能力。将语音识别结果进行比较,将最初的语音定位嵌入和绝对读取率(ER6)显示,降幅降幅降幅降幅率为47。