用于快速可流流流的文字到语音频谱建模的多子关注结构 (Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling)

Typical high quality text-to-speech (TTS) systems today use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio. High-quality spectrum models usually incorporate the encoder-decoder architecture with self-attention or bi-directional long short-term (BLSTM) units. While these models can produce high quality speech, they often incur O($L$) increase in both latency and real-time factor (RTF) with respect to input length $L$. In other words, longer inputs leads to longer delay and slower synthesis speed, limiting its use in real-time applications. In this paper, we propose a multi-rate attention architecture that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding. The proposed architecture achieves high audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low RTF at the same time. Meanwhile, both latency and RTF of the proposed system stay constant regardless of input lengths, making it ideal for real-time applications.

翻译：目前,典型的高质量文本到语音系统(TTS)使用一个两阶段结构,其频谱模型阶段产生光谱框架,而电码阶段产生实际音频。高质量的频谱模型通常包含带有自我注意或双向长期短期(BLSTM)单元的编码器-解码器结构。这些模型可以产生高质量的语音,但在输入长度方面,它们往往会增加内延和实时系数(RTF)O(L$),换句话说,较长的输入导致更长的延迟和慢速合成速度,限制其在实时应用中的使用。在本文件中,我们提出一个多比率的注意结构,通过在编码过程中计算紧凑的表示和在解调过程中以流态的方式反复产生注意矢量,打破延缩和调调调的瓶颈。拟议结构在输入长度上达到高音频质量(MOS 4.31比地心4.48)、低的拉特,以及在同一时间的RTF值低。同时,我们提议对理想系统应用进行实时的延时和RTF,不管其实际输入是否固定。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【LITIS Lab】衔接图卷积神经网络谱域和空间域，Spectral and Spatial Domains in GNN

专知会员服务

25+阅读 · 2020年3月30日