利用单调硬潜潜值调整的边缘化,对编码器-编码器终端到终端TTT框架进行初始调查 (Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments)

End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to output acoustic features using a single network. A recent advance of end-to-end TTS is due to a key technique called attention mechanisms, and all successful methods proposed so far have been based on soft attention mechanisms. However, although network structures are becoming increasingly complex, end-to-end TTS systems with soft attention mechanisms may still fail to learn and to predict accurate alignment between the input and output. This may be because the soft attention mechanisms are too flexible. Therefore, we propose an approach that has more explicit but natural constraints suitable for speech signals to make alignment learning and prediction of end-to-end TTS systems more robust. The proposed system, with the constrained alignment scheme borrowed from segment-to-segment neural transduction (SSNT), directly calculates the joint probability of acoustic features and alignment given an input text. The alignment is designed to be hard and monotonically increase by considering the speech nature, and it is treated as a latent variable and marginalized during training. During prediction, both the alignment and acoustic features can be generated from the probabilistic distributions. The advantages of our approach are that we can simplify many modules for the soft attention and that we can train the end-to-end TTS model using a single likelihood function. As far as we know, our approach is the first end-to-end TTS without a soft attention mechanism.

翻译：端到端文本到语音合成(TTS)合成是一种方法,它直接将输入文字转换成使用单一网络的声学功能。最近端到端 TTS的进步是由于一种关键技术,即关注机制,而迄今提出的所有成功方法都以软关注机制为基础。然而,尽管网络结构日益复杂,但带有软关注机制的端到端 TTS系统可能仍然无法学习,也无法预测输入和输出之间的准确匹配。这可能是因为软关注机制过于灵活。因此,我们建议一种方法,它具有更明确但自然的限制,适合于语音信号,使端到端到端 TTTS系统的统一学习和预测更加有力。拟议的系统,由于从段到隔离感应转换(SSNTT)中借用了有限的调整计划,直接计算了音学特征和对齐输入文本的共同概率。考虑到演讲的性质,这种调整的目的是困难和单质的增加,而且它被视为一种潜在的变量和边缘化。在培训期间,我们预测期间,调整和声学特性的特性既适合更明确,也适合使端到终端 TTTTTS系统系统系统系统系统系统系统系统更加可靠。利用简化的稳定性分配功能,因此,我们可以通过简化的稳定性分配模式和最终分配方式产生一种优势。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日