用于带宽扩展的谐波-打击乐解缠神经音频编解码器 (Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension)

Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.

翻译：带宽扩展，即从音频信号的低通版本重建其高频分量，是音频处理中长期存在的问题。传统方法随着信号处理领域的整体趋势而发展，而近期神经架构的进展在广泛的音频任务中显著提升了性能。在本研究中，我们通过将带宽扩展构建为音频令牌预测问题，进一步推进了这些进展。具体而言，我们基于解缠神经音频编解码器生成的离散表示训练了一个基于Transformer的语言模型，其中解缠过程由输入信号的谐波-打击乐分解引导，突出了对带宽扩展特别相关的频谱结构。我们的方法引入了一种新颖的编解码器设计，明确考虑了下游令牌预测任务，实现了编解码器结构与Transformer建模之间更有效的耦合。这种联合设计通过客观指标和主观评估均证明了其能够高质量重建原始信号。这些结果强调了将编解码器解缠和表示学习与生成建模阶段对齐的重要性，并展示了全局、表示感知设计在推进带宽扩展方面的潜力。