Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the "information sieve" to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample-upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage. Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron.
翻译:表达式神经文本到语音( TTS) 系统包含一个样式编码器, 以学习作为样式信息的隐含嵌入。 但是, 这个嵌入程序可以将多余的文本信息编码为编码。 这个现象被称为内容泄漏。 研究人员试图通过添加 ASR 或其他辅助性监管损失功能来解决这个问题。 在本研究中, 我们提出一个名为“ 信息筛选” 的未经监督的方法, 以减少在 prosody 传输中内容渗漏的影响。 这种方法的理由是, 样式编码器可以通过一个设计完善的低模上层过滤器, 而不是引用演讲中包含的文本信息, 从而迫使该样式编码器专注于样式信息而不是文本信息。 听试验显示, 提取的样式嵌入式可以在一定的间隔内降为缩放, 然后通过复制来提升。 此外, 我们用变动层的例常态化来帮助系统学习更好的潜伏风格空间。 目标指标, 如低得多的字误率率( WER) 能够证明该模型在减轻内容渗漏方面的有效性。 听测试表明, 模型保留了原制调调调调调的基模, 。