This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.
翻译:本文展示了CQT- Diff, 这是一种数据驱动的基因变异音频模型, 一旦经过培训, 可用于在问题不可知环境中解决各种不同的反听问题。 CQT- Diff 是一个神经扩散模型, 其结构是精心建造的, 以利用音乐中的声Q- Q异性对称。 实现这一模型的前提条件是使用一个不可逆的常数- Q 变换( CQT) 模型, 其对数- 空位频率轴代表着音频变异等。 所提议方法用三种不同任务( 音频带宽扩展、 绘图和 裁剪裁) 的客观和主观尺度进行评估。 结果显示, CQT- Diff 超越了比较基线和音频带宽扩展的宽度, 在不进行再培训的情况下, 提供了在音频调和断线中现代基线的竞争性性能。 这项工作代表了第一个基于传播的解决音频处理中反问题的一般框架 。