This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.
翻译:本文提出了CQT-Diff,一种数据驱动的生成音频模型,可以在问题不可知的情况下用于解决各种不同的音频反问题。 CQT-Diff是一种神经扩散模型,其架构经过精心构建,以利用音乐中的音高等价对称性。这是通过使用具有可逆性的常量Q变换(CQT)对模型进行预处理来实现的,其对数间距频率轴表示音高等价于平移等价性。所提出的方法使用客观和主观指标在三个不同且各不相同的任务中进行评估:音频带宽扩展、修补和去剪切。结果显示,CQT-Diff在音频带宽扩展方面优于其他基线和删除掉某些元素时的结果,并在不需要再次训练时,在音频修补和去剪切方面与现代基线具有竞争力。这项工作代表了首个在音频处理中解决反问题的扩散基础框架。