Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech (UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities. (project page: http://lab.rekimoto.org/projects/wesper )
翻译:由于低语语音的声压大大低于正常语音的声压,因此可以用作公共场所的半静音语音互动,而不必听别人的话; 将低语转换为正常语音,也可以提高语言质量; 然而,常规语音转换技术不能提供足够的转换质量,或要求由低语和正常语音表达方式组成的依靠语器的数据集。 为了解决这些问题,我们提议WESPER是一个零声、实时低语到正常语音转换机制,其基础是自我监督学习。WESPER是一个语音到单位(STU)的半静音互动机制,它生成了语言和正常语音或听觉障碍的隐性语言质量; 常规语音转换技术不能提供足够的转换质量, 或需要由低语和正常语音表达方式组成。 与现有方法不同的是,这种转换不依靠用户,不需要配对的语音和正常语音转换机制。 USTSPER是一个语音重组目标, 而我们只能用正常的语音/语言转换工具进行。</s>