We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variables encode speech contents, as reconstructed speech can be recognized by ASR with low word error rates (WER), even with a different global encoding; (2) the global latent variables encode speaker style, as reconstructed speech shares speaker identity with the source utterance of the global encoding. Additionally, we demonstrate an useful application from our pre-trained model, where we can train a speaker recognition model from the global latent variables and achieve high accuracy by fine-tuning with as few data as one label per speaker.
翻译:我们提出一种不受监督地学习语言表达方式去分解内容和风格的方法。我们的模型包括:(1) 一个本地编码器,捕捉每个框架的信息;(2) 一个全球编码器,捕捉每个降压信息;(3) 一个有条件的编码器,根据本地和全球潜伏变量来重建语言。我们的实验表明:(1) 当地潜伏变量,因为重新组合的语音内容,即使有不同的全球编码,也可以被ASR(WER)识别为低单词错误率(WER ) ;(2) 全球潜伏变量,将演讲者风格编码,作为重新组合的演讲者身份与全球编码的来源共享。 此外,我们展示了我们预先培训的模式中的一种有用的应用,我们可以从全球潜伏变量中培训一个语音识别模型,并通过微调数据作为每个演讲者的一个标签实现高精度。