The high temporal resolution of audio and our perceptual sensitivity to small irregularities in waveforms make synthesizing at high sampling rates a complex and computationally intensive task, prohibiting real-time, controllable synthesis within many approaches. In this work we aim to shed light on the potential of Conditional Implicit Neural Representations (CINRs) as lightweight backbones in generative frameworks for audio synthesis. Our experiments show that small Periodic Conditional INRs (PCINRs) learn faster and generally produce quantitatively better audio reconstructions than Transposed Convolutional Neural Networks with equal parameter counts. However, their performance is very sensitive to activation scaling hyperparameters. When learning to represent more uniform sets, PCINRs tend to introduce artificial high-frequency components in reconstructions. We validate this noise can be minimized by applying standard weight regularization during training or decreasing the compositional depth of PCINRs, and suggest directions for future research.
翻译:音频和我们对波形小不规则的感知敏感性的高度时间分辨率使我们对高采样率的合成成为复杂和计算密集的任务,在许多方法中禁止实时、可控制的合成。在这项工作中,我们的目标是阐明有条件的隐性神经代表(CINRs)作为音频合成基因框架的轻质脊柱的潜力。我们的实验表明,小型定期有条件IRS(PCINRs)学习速度更快,而且一般在数量上比具有同等参数的跨波进神经网络(Transposed Convolution Neural Neurs)更能产生质量上更好的音频重建。然而,它们的性能对于激活超参数非常敏感。当学习更统一的组合时,PCIRs往往在重建中引入人工的高频组件。我们通过在培训中应用标准重量规范或降低PCIRs的组成深度来验证这种噪音,并为今后的研究提出方向。