Recent neural vocoders usually use a WaveNet-like network to capture the long-term dependencies of the waveform, but a large number of parameters are required to obtain good modeling capabilities. In this paper, an efficient network, named location-variable convolution, is proposed to model the dependencies of waveforms. Different from the use of unified convolution kernels in WaveNet to capture the dependencies of arbitrary waveforms, location-variable convolutions utilizes a kernel predictor to generate multiple sets of convolution kernels based on the mel-spectrum, where each set of convolution kernels is used to perform convolution operations on the associated waveform intervals. Combining WaveGlow and location-variable convolutions, an efficient vocoder, named MelGlow, is designed. Experiments on the LJSpeech dataset show that MelGlow achieves better performance than WaveGlow at small model sizes, which verifies the effectiveness and potential optimization space of location-variable convolutions.
翻译:最近的神经电解器通常使用类似于波形网络的网络来捕捉波形的长期依赖性,但需要大量的参数才能获得良好的模型能力。 在本文中,建议建立一个高效的网络,称为位置可变变变变,以模拟波形的依赖性。不同于在波形网中使用统一的卷变内核来捕捉任意波形的依赖性,位置可变共振利用内核预测器在中子频谱的基础上生成多组共振内核,其中每组卷动内核用于在相关波形间隔上进行卷动操作。将波形和位置可变变共振相结合,设计了一个高效的电动电相,名为MelGlow。对LJSpeech数据集的实验显示,MelGlow在小模型尺寸上比WaveGlow取得更好的性能,后者验证了地点可变同变换空间的有效性和潜在优化性能。