关于 " 现实世界自发性演讲 " 的语音语音综述文本矢量量化方法 (A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech)

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

翻译：最近在阅读或行为组合方面受过培训的文本到语音系统(TTS)已经接近人类层面的自然性质。然而,人类言论的多样性往往超出了这些公司的范围。我们认为,处理这种多样性的能力对于AI系统实现人类层面的通信至关重要。我们的工作探索了如何使用更丰富的真实世界数据来建立语音合成器。我们用YouTube和播客的真实世界语言来培训TS系统。我们观察了以Mel-spectrog为基础的自动反制模型中培训和推断一致性的不匹配,导致无法理解的合成,并表明在多个代码组中学习的离散代码有效地解决了这一问题。我们引入了我们的MQTTS系统,该系统的设计是为了多代码生成和单声调,同时使用干净的沉默提示来提高合成质量。我们用真实世界的语音分析来确定我们的方法的功效。我们显示,MQTTS在几个客观和主观的措施中超越了现有的 TTS系统。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日