Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. To address this, we propose the integration of time-varying feature-wise linear modulation into existing temporal convolutional backbones, an approach that enables learnable adaptation of the intermediate activations. We demonstrate that our approach more accurately captures long-range dependencies for a range of fuzz and compressor implementations across both time and frequency domain metrics. We provide sound examples, source code, and pretrained models to faciliate reproducibility.
翻译:然而,关于音效黑盒建模的深层学习方法已经显示出希望,但大多数现有工作侧重于非线性效果,其行为在较短的时间尺度上具有相对短暂的动作,如吉他放大器和扭曲。虽然从理论上来说,经常和进化结构可以扩展,以在较长的时间尺度上捕捉行为,但我们表明,在模拟音效(如模糊和动态范围压缩)时,仅仅缩小现有结构的宽度、深度或放大系数并不产生令人满意的性能。为了解决这个问题,我们提议将时间变化的地貌特征线性调制纳入现有的时转骨干中,这种方法能够使中间激活进行可学习的适应。我们表明,我们的方法更准确地捕捉出在时间和频域测量中一系列模糊和压缩的远程依赖性。我们提供了声音示例、源代码和预先训练的模型,以便进行法化再生。