Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at modeling dependencies over longer time scales, but they suffered from quadratic constraints to scale them. We propose a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders, fully trained end-to-end: thereby allowing to learn representations as it deems fit for the next sample. Unlike previous works that compared different time scales to show improvement, we use a standard dataset, with the same number of parameters/context to show improvements. We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN on a standard dataset for modeling long-term structure. This work gives very exciting direction for the field, given improvements in context modeling that can be scaled with more data, as well as potentially better results by using billions/trillions of parameters.
翻译:模拟音频信号的长期依赖性是一个特别具有挑战性的问题,因为即使是小型规模的音频信号模型,也会在10万个样本的顺序上产生小比例的收成。随着最近变异器的出现,神经结构在较长时间尺度上对模拟依赖性模型变得十分有利,但是它们却受到四级限制的制约,因此其规模要缩小。我们提议了一个基因化的自动递减结构,可以在相当大的背景下模拟音频波形,超过50万个样本。我们的工作通过学习CNN前端的潜在代表形式来学习时间依赖性,然后利用经过充分训练的变异器编码、端对端对端学习这些显示的依赖性来学习。这项工作与以往比较不同,不同,我们使用一个标准数据集,用同样数量的参数/文字来显示改进。与Wavenet、SASHMI和样本-RNNN等其他方法相比,我们取得了一个最先进的性能,通过模型结构的标准数据集来学习这些显示长期结构。这项工作为下一个样本提供了令人振奋奋人心的方向,在模型中,通过更多的模型可以使实地得到更好的改进。