A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis. It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound. However, it is designed for a monophonic harmonic sound and cannot handle mixtures of harmonic sounds. In this paper, we propose a model (DDSP mixture model) that represents a mixture as the sum of the outputs of multiple pretrained DDSP autoencoders. By fitting the output of the proposed model to the observed mixture, we can directly estimate the synthesis parameters of each source. Through synthesis parameter extraction experiments, we show that the proposed method has high and stable performance compared with a straightforward method that applies the DDSP autoencoder to the signals separated by an audio source separation method.
翻译:一个不同的数字信号处理(DDSP)自动编码器是一个音乐声音合成器,它将深神经网络(DNN)和光谱建模合成结合起来。它使我们能够通过改变从输入声音中提取的基本频率、丁字节特征和音响(合成参数)来灵活编辑声音。然而,它是为单声调声音设计的,不能处理和谐声音的混合物。在本文中,我们提出了一个模型(DDSP混合物模型),它代表一种混合物,是多种经过预先训练的DDSP自动编码器的输出的总和。通过将拟议模型的输出与所观察到的混合物相匹配,我们可以直接估计每种来源的合成参数。通过合成参数提取实验,我们显示,拟议方法的性能与将DSP自动编码器应用到音源分离的信号的直截法方法相比,具有高而稳定的性能。