WaveTTS: 具有联合时间-公平域损益的基于塔可天的TTS (WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss)

Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input. Such frameworks typically consist of a feature prediction network that maps character sequences to frequency-domain acoustic features, followed by a waveform reconstruction algorithm or a neural vocoder that generates the time-domain waveform from acoustic features. As the loss function is usually calculated only for frequency-domain acoustic features, that doesn't directly control the quality of the generated time-domain waveform. To address this problem, we propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions: 1) time-domain loss, denoted as the waveform loss, that measures the distortion between the natural and generated waveform; and 2) frequency-domain loss, that measures the Mel-scale acoustic feature loss between the natural and generated acoustic features. WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform. To our best knowledge, this is the first implementation of Tacotron with joint time-frequency domain loss. Experimental results show that the proposed framework outperforms the baselines and achieves high-quality synthesized speech.

翻译：这种框架通常包括一个地貌预测网络,用于绘制频率-界域声学特征的字符序列,然后是波形重建算法或神经电解码,从声学特征生成时-界域波形。由于损失函数通常只针对频率-界域声学特征计算,这些特征并不直接控制生成的时间-界域波形的质量。为了解决这个问题,我们为塔克坦基域波形(称为WaveTTTS)提出一个新的培训计划,它有两个损失功能:1)时间-界域损失,以波形损失为代号,以测量自然和生成波形之间的扭曲;和2)频率-界值损失,以测量自然和生成的声学特征之间的梅尔级声学特征损失。WaveTTS确保了声学特征的质量以及由此产生的语音波状。据我们所知,这是首次实施塔科ron系统,同时进行时间-频域损失。实验结果显示,拟议框架将超越高质量基准和高质量合成。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

【阿里巴巴-CVPR2020】频域学习，Learning in the Frequency Domain

专知会员服务

29+阅读 · 2020年3月14日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日