MIDI至Audio合成的文本到语音合成技术 (Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis)

Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in similar ways to TTS frameworks. We also include reference systems using conventional sound modeling techniques such as sample-based and physical-modeling-based methods. The subjective experimental results demonstrate that the investigated TTS components can be applied to piano MIDI-to-audio synthesis with minor modifications. The results also reveal the performance bottleneck -- while the waveform model can synthesize high quality piano sound given natural acoustic features, the conversion from MIDI to acoustic features is challenging. The full MIDI-to-audio synthesis system is still inferior to the sample-based or physical-modeling-based approaches, but we encourage TTS researchers to test their TTS models for this new task and improve the performance.

翻译：语音合成和音乐音频生成,其象征性投入在许多方面各不相同,但有一些相似之处。在本研究中,我们研究了如何将文字到语音合成技术用于钢琴 MIDI-to-udio合成任务。我们的调查包括塔可伦和神经源过滤器波形模型作为基本组成部分,我们用与TTS框架类似的方式建立MIDI-to-andio合成系统。我们还包括使用常规声音模拟技术(例如基于样本和物理模型的方法)的参考系统。主观实验结果显示,所调查的TTS组件可以应用到钢琴 MIDI-to-udio合成中,但稍作修改。结果还显示性能瓶装 -- -- 而波形模型可以合成高质量的钢琴声,并具有自然声学特征,从MIDI转换到声学特征是具有挑战性的。完整的MIDI-audi合成系统仍然低于基于样本或物理模型的方法,但我们鼓励TT研究人员测试其TTS模型来完成这项新任务并改进性能。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。