An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generation. In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. This enables training on a wide range of transcription datasets with a single model, which in turn offers note-level control of composition and instrumentation across a wide range of instruments. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We compare training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and find that the DDPM approach is superior both qualitatively and as measured by audio reconstruction and Fr\'echet distance metrics. Given the interactivity and generality of this approach, we find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
翻译:理想的音乐合成器应该是互动的和表达式的,能够实时生成高不听觉的音频,以便任意组合仪器和音符。最近的神经合成器展示了不同领域模型之间的权衡,这些模型只提供对特定仪器的详细控制,或能对任何音乐进行详细控制的原始波形模型,但能进行微小控制和慢生成。在这项工作中,我们侧重于一个神经合成器的中间层,这些光学合成器可以产生来自MIDI序列的音频,同时将仪器的实时任意组合生成。这样就可以用一个单一模型对范围广泛的转录数据集进行培训,这又能提供对多种工具的构成和仪器进行笔记级控制。我们使用一个简单的两阶段过程:MIDI对光谱仪进行详细控制,对任何音乐进行精密的解码变形器进行训练,然后将光谱成声学对抗网络(GAN)光谱图。我们比较了解调器作为自动递增模型和分解解解调的复合模型(DPM),并发现DDPM的内位和仪器对于互动的配置方法,通过声学和感光学和感光感光力学和感光力分析方法,可以测量和感测测测测到直径和感和感和感地感地感地平和感和感力力力力力力力力的图。