With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and training strategy, aiming to synthesize highly natural-sounding audio. Moreover, we conducted an extensive model evaluation through listening tests, pitch measurement, and spectrogram analysis. This work demonstrates not only synthesis of highly natural music but offers a thorough analytical approach and useful outcomes for the community. Our code and pre-trained models are open sourced at https://github.com/nii-yamagishilab/midi-to-audio.
翻译:由于象征性投入的音乐和语音合成与文本到语音(TTS)技术的迅速发展具有相似性,因此值得探索如何通过从TTS技术中借用技术来改进MIDI到音频的性能。在这项研究中,我们分析了基于TTS的MIDI到音频系统的缺点,并在地物计算、模型选择和培训战略方面加以改进,目的是合成高度自然声音。此外,我们通过监听测试、音频测量和光谱分析进行了广泛的模型评估。这项工作不仅展示了高度自然音乐的合成,而且为社区提供了透彻的分析方法和有益的结果。我们的代码和预先培训的模型可以在https://github.com/niii-yamagishilab/midi-to-audio上公开提供。