Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in similar ways to TTS frameworks. We also include reference systems using conventional sound modeling techniques such as sample-based and physical-modeling-based methods. The subjective experimental results demonstrate that the investigated TTS components can be applied to piano MIDI-to-audio synthesis with minor modifications. The results also reveal the performance bottleneck -- while the waveform model can synthesize high quality piano sound given natural acoustic features, the conversion from MIDI to acoustic features is challenging. The full MIDI-to-audio synthesis system is still inferior to the sample-based or physical-modeling-based approaches, but we encourage TTS researchers to test their TTS models for this new task and improve the performance.
翻译:语音合成和音乐音频生成,其象征性投入在许多方面各不相同,但有一些相似之处。在本研究中,我们研究了如何将文字到语音合成技术用于钢琴 MIDI-to-udio合成任务。我们的调查包括塔可伦和神经源过滤器波形模型作为基本组成部分,我们用与TTS框架类似的方式建立MIDI-to-andio合成系统。我们还包括使用常规声音模拟技术(例如基于样本和物理模型的方法)的参考系统。主观实验结果显示,所调查的TTS组件可以应用到钢琴 MIDI-to-udio合成中,但稍作修改。结果还显示性能瓶装 -- -- 而波形模型可以合成高质量的钢琴声,并具有自然声学特征,从MIDI转换到声学特征是具有挑战性的。完整的MIDI-audi合成系统仍然低于基于样本或物理模型的方法,但我们鼓励TT研究人员测试其TTS模型来完成这项新任务并改进性能。