The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can synthesize natural speech from text from an unseen target speaker's reference signal which might have less than ideal recording conditions, i.e. reverberations or mixed with other speakers.
翻译:语音图示(TTS)不是决定性的,根据背景,字母可能发音不同,或电话可能因性别、年龄、口音、情绪等各种生理和文体因素而不同,如性别、年龄、口音、情绪等。 神经演讲者嵌入器,受过识别或验证演讲者的训练,通常用来代表这些特征,并将这些特征从参考演讲转变为综合演讲。另一方面,语音图示分离是一项具有挑战性的任务,即将个别演讲者与不同演讲者相互重叠的混合信号区分开来。发言者吸引器是高维的嵌入矢量,将每个演讲者的发言时间-频箱引向自己,同时将属于其他演讲者的声音排在背面。在这项工作中,我们探讨利用这些强大的扬声器在多方讲者 TTS 合成中进行零速调整的可能性,并向演讲者提出吸引文字(SATTS) 通过各种实验,我们证明SATTS可以将自然讲话与无形的目标演讲者的参考信号合成,这些信号可能不理想的录音条件,即重复或与其他发言者混音。