Speech synthesis is used in a wide variety of industries. Nonetheless, it always sounds flat or robotic. The state of the art methods that allow for prosody control are very cumbersome to use and do not allow easy tuning. To tackle some of these drawbacks, in this work we target the implementation of a text-to-speech model where the inferred speech can be tuned with the desired emotions. To do so, we use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism. We evaluate four different configurations considering different inputs and training strategies, study them and prove how our best model can generate speech files that lie in the same distribution as the initial training dataset. Additionally, a new strategy to boost the training convergence by applying a guided attention loss is proposed.
翻译:语音合成在各种各样的行业中使用。 但是,它总是听起来平坦或机器人。 允许假肢控制的最新方法非常繁琐,使用起来不易调整。 要解决其中的一些缺陷,我们在这项工作中的目标是执行文本到语音模型,其中推断的语音可以与期望的情绪调和。 为了做到这一点,我们利用关注机制, 使用基因辅助网络(GANs)和序列到序列模型。 我们评估四种不同的配置, 考虑不同的投入和培训策略, 研究它们, 证明我们的最佳模式如何产生与初始培训数据集相同的语言文件。 此外, 提出了一个新的战略, 通过应用引导关注损失来促进培训的趋同。