Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention, despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates through a limited number of samples. In this work, we propose a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes. This offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips. We evaluate our approach on the UrbanSound8K dataset, compared to SampleRNN, with the performance metrics measuring the quality and diversity of generated sounds. Experimental results show that our method offers comparable performance in quality and significantly better performance in diversity.
翻译:深层基因模型最近在语音和音乐合成方面取得了令人印象深刻的成绩,然而,与产生这些特定领域的声音相比,尽管应用范围很广,但产生一般声音(如警笛、枪声)的注意力却较少;在以往的工作中,考虑在时域内合理生成采样RNN方法;然而,采样RNN在捕捉声音中的远距离依赖性方面可能有限,因为它仅通过数量有限的样本进行后方推进;在这项工作中,我们建议一种通过神经离散时间频率学习产生声音的方法,以音频班为条件;这有利于有效地模拟远距离依赖性,并将本地精细结构保留在音剪中;我们评估我们在城市Sound8K数据集上的做法,与样本RNNN相比,与衡量声音质量和多样性的性能指标相比较;实验结果显示,我们的方法在质量上具有可比性,在多样性方面业绩显著提高。