With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and training strategy, aiming to synthesize highly natural-sounding audio. Moreover, we conducted an extensive model evaluation through listening tests, pitch measurement, and spectrogram analysis. This work demonstrates not only synthesis of highly natural music but offers a thorough analytical approach and useful outcomes for the community. Our code, pre-trained models, supplementary materials, and audio samples are open sourced at https://github.com/nii-yamagishilab/midi-to-audio.
翻译:音乐和语音合成的符号输入之间存在相似之处,随着文本转语音(TTS)技术的快速发展,探索借鉴TTS技术改进MIDI到音频合成效果的方法具有价值。本研究分析了基于TTS的MIDI到音频系统的缺点,并从特征计算、模型选择和训练策略三方面改进它,旨在合成高度自然的声音。此外,我们通过听力测试、音高测量和谱图分析进行了广泛的模型评估。本研究不仅演示了高度自然的音乐合成,而且提供了针对社区的全面分析方法和有用的结果。我们的代码、预训练模型、补充材料和音频样本都在 https://github.com/nii-yamagishilab/midi-to-audio 开源。