This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet. This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends only on the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the architecture to learn which audio samples are important for the prediction of the future sample. We show how causal transformer generative models can be used for raw waveform synthesis. We also show that this performance can be improved by another 2% by conditioning samples over a wider context. The flexibility of the current model to synthesize audio from latent representations suggests a large number of potential applications. The novel approach of using generative transformer architectures for raw audio synthesis is, however, still far away from generating any meaningful music, without using latent codes/meta-data to aid the generation process.
翻译:本文提出一种使用变压器结构在波形层进行声学合成的新方式。 我们提出一个用于生成波形的深度神经网络, 类似于波形。 这是完全概率性、 自动递减性和因果性的, 也就是说, 生成的每个样本都只取决于先前观察到的样本。 我们的方法在类似数据集上比一个广泛使用的波状结构高出高达9%, 用于预测下一步。 但是, 使用关注机制, 我们使得结构能够了解哪些音样对预测未来样本很重要。 我们展示了如何将因果变压器基因模型用于原始波形合成。 我们还表明, 通过在更广泛的范围内对样本进行调控, 还可以使这一性能得到另外2%的改进。 目前模型从潜在表达中合成音力的灵活性表明了大量潜在的应用。 但是, 使用基因变压器结构进行原始音合成的新方法离产生任何有意义的音乐还很远,, 没有使用潜在代码/ 元数据来帮助生成过程 。