Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SpeechSplit has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning. Experiments on different style transfer tasks show that AutoPST can effectively convert prosody that correctly reflects the styles of the target domains.
翻译:Prosody 在描述一个演讲者或情感的风格方面发挥着重要的作用, 但大多数非平行的声音或情绪风格传输算法并不转换任何 prosody 信息。 prosody 的两个主要组成部分是音速和节奏。 将 prosody 信息, 特别是节奏部分从演讲中分离出来, 具有挑战性, 因为它涉及到打破输入式演讲和分解的语音表达方式之间的同步。 因此, 大部分现有的 prosotion 风格传输算法需要依赖某种形式的文本转录来识别内容信息, 这些信息将其应用程序限制在高资源语言。 最近, SolomentSplit 已经朝着非超导的 prosody风格传输取得了相当大的进展, 但是它无法以不超导的方式提取高层次的全球 prosody 风格。 在本文中, 我们建议 AutPST, 它可以在不依赖任何文本校正校正的校正描述下, 以自动编码为基础的Prosody Sty 传输框架, 由自我表达式风格学习的彻底的节态删除模块模块 。