Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.
翻译:允许将一个图像的艺术风格应用到另一个图像的神经风格传输,在引入后不久就成为最广泛展示的计算机视觉应用之一。相反,直到最近,音乐音频域的相关任务基本上仍未处理。虽然提出了几种适合音乐信号的风格转换方法,但大多数都缺乏经典图像风格传输算法的“一发”能力。另一方面,音乐投入现有一发音风格传输方法的结果并不那么令人信服。在这项工作中,我们特别关心一发色调转的问题。我们根据矢量分解的变异自动编码器(VQ-VAE)的扩展,为这项任务提出了一个新颖的方法,以及一个简单的自我监督的学习战略,目的是获得调音调和音调的分解式表达。我们用一套客观的衡量标准来评估该方法,并表明它能够超越选定的基线。