Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high-dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.
翻译:文本到语音系统可以产生在培训期间看不到的全新声音。 但是,从一个高维的扬声空间高效地创造个性化声音仍是一项艰巨的任务。 在这项工作中,我们使用经过数千名发言者培训的语音嵌入器(SpeakerNet),为TTS模式提供条件。我们使用人类抽样模型来探索这个发声者潜伏的空间。我们显示用户可以创造适合面部照片、艺术肖像和卡通图片的声音。我们征聘在线参与者来集体操控说话面的声音。我们显示:(1) 一组单独的人类计票员确认所创造的声音与面部相匹配,(2) 声音中清晰可见的扬声器在声音中,(3) 人们不断向真实的声音原型前进。我们的结果表明,这种技术可以应用到广泛的应用中,包括音量书籍和游戏中的人物声音发展,个性化的语音助手,以及有语音障碍的人的个人声音。