In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speaker-content disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition.
翻译:在这项工作中,我们处理语言信号的风格和内容脱钩问题。我们建议使用两个编码器:一个内容编码器和一个风格编码器,完全进化变异自动编码器。为了促进分解,我们建议采用对抗式对比式预测编码法。这种新的分解法既不需要平行数据,也不需要任何监督。我们表明,拟议的技术能够将语言和内容特性分解成两种不同的表达方式,并显示与其他不受监督的方法相比,有竞争力的语音和内容分解性能。我们进一步表明,与用于电话识别的光谱特征相比,与火车测试的不匹配相比,内容表达方式更加稳健。