Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples.
翻译:适合模拟原始音频的结构开发是一个具有挑战性的问题,因为声音波形的取样率很高。标准序列模型方法,如RNN和CNN等,以前已经根据音频需求定制了标准序列模型方法,但由此产生的结构使得不可取的计算取舍和努力有效地模拟波形。我们提议Sashimi,这是围绕最近推出的S4模型为长期序列建构的波形模型的新的多尺度结构。我们确定S4在自动递进型中可能不稳定,并通过绘制 Hurwitz 矩阵的连接来简单改进它的参数化。萨希米在自动递进式环境中为无条件的波形生成提供最先进的性能。此外,萨希米改进了作为传播模型主干结构的不易性能。与以前在自我递进化型模型设置中的结构相比,萨希米生成钢琴和语音波形组合在无条件的语音模型生成任务中比WWENet得到更好的平均评分。萨希米亚,在Swab-read Stread Streax Streal-degraphyal Stredustration上, sa-destry semstry sememiss sabs supstration supdustrutes 3s syal ex