A vocoder is a conditional audio generation model that converts acoustic features such as mel-spectrograms into waveforms. Taking inspiration from Differentiable Digital Signal Processing (DDSP), we propose a new vocoder named SawSing for singing voices. SawSing synthesizes the harmonic part of singing voices by filtering a sawtooth source signal with a linear time-variant finite impulse response filter whose coefficients are estimated from the input mel-spectrogram by a neural network. As this approach enforces phase continuity, SawSing can generate singing voices without the phase-discontinuity glitch of many existing vocoders. Moreover, the source-filter assumption provides an inductive bias that allows SawSing to be trained on a small amount of data. Our experiments show that SawSing converges much faster and outperforms state-of-the-art generative adversarial network and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.
翻译:vocoder 是一种有条件的音频生成模型, 将Mel- spectrogram等音频特性转换成波形。 我们从不同的数字信号处理( DDSP) 获得灵感后, 提出一个新的vocoder, 名为 SawSing, 用于歌声。 SawSing 将歌声的调和部分合成出来, 过滤一个锯牙源信号, 并使用一个线性时间可变的有限脉冲反应过滤器, 其系数由神经网络输入的Mel- spectrogram来估算。 由于这个方法强制实施阶段的连续性, SawSing 可以在没有现有许多vocoders 的相异性格时生成歌声。 此外, 源过滤器假设提供了一种感官偏差, 使得SawSing能够接受少量数据的培训。 我们的实验显示, SawSing在资源有限的情况下, 只有3个培训记录和3小时的培训时间, 将最先进的基因对抗网络和基于传播的vocoders 。