We present a framework that can impose the audio effects and production style from one recording to another by example with the goal of simplifying the audio production process. We train a deep neural network to analyze an input recording and a style reference recording, and predict the control parameters of audio effects used to render the output. In contrast to past work, we integrate audio effects as differentiable operators in our framework, perform backpropagation through audio effects, and optimize end-to-end using an audio-domain loss. We use a self-supervised training strategy enabling automatic control of audio effects without the use of any labeled or paired training data. We survey a range of existing and new approaches for differentiable signal processing, showing how each can be integrated into our framework while discussing their trade-offs. We evaluate our approach on both speech and music tasks, demonstrating that our approach generalizes both to unseen recordings and even to sample rates different than those seen during training. Our approach produces convincing production style transfer results with the ability to transform input recordings to produced recordings, yielding audio effect control parameters that enable interpretability and user interaction.
翻译:我们提出了一个框架,可以将音效和制作风格从一个录音到另一个录音,以简化音效生产过程为目标。我们训练一个深神经网络,分析输入记录和风格参考记录,并预测用于输出的音效控制参数。我们与过去的工作不同,我们把音效作为不同操作者纳入我们的框架,通过音效进行背对映,并利用音效损失优化端对端。我们采用自监督的培训战略,在不使用任何标签或配对培训数据的情况下自动控制音效。我们调查一系列现有的和新的不同信号处理方法,显示如何在讨论其取舍时将每种信号纳入我们的框架。我们评价我们关于语音和音乐任务的方法,表明我们的做法既概括了隐蔽录音,甚至抽样率也不同于培训期间所看到的。我们的方法产生了令人信服的生产风格传输结果,能够将输入记录转换为制作录音,产生能够进行翻译和用户互动的音效控制参数。