Neural audio codecs have been widely studied for mono and stereo signals, but spatial audio remains largely unexplored. We present the first discrete neural spatial audio codec for first-order ambisonics (FOA). Building on the WavTokenizer architecture, we extend it to support four-channel FOA signals and introduce a novel spatial consistency loss to preserve directional cues in the reconstructed signals under a highly compressed representation. Our codec compresses 4-channel FOA audio at 24 kHz into 75 discrete tokens per second, corresponding to a bit rate of 0.9 kbps. Evaluations on simulated reverberant mixtures, non-reverberant clean speech, and FOA mixtures with real room impulse responses show accurate reconstruction, with mean angular errors of 13.76{\deg}, 3.96{\deg}, and 25.83{\deg}, respectively, across the three conditions. In addition, discrete latent representations derived from our codec provide useful features for downstream spatial audio tasks, as demonstrated on sound event localization and detection with STARSS23 real recordings.
翻译:神经音频编解码器已在单声道和立体声信号中得到广泛研究,但空间音频领域仍鲜有探索。本文提出了首个针对一阶Ambisonics(FOA)的离散神经空间音频编解码器。基于WavTokenizer架构,我们将其扩展至支持四通道FOA信号,并引入一种新颖的空间一致性损失,以在高度压缩表示下保持重构信号中的方向线索。本编解码器将24 kHz的四通道FOA音频压缩为每秒75个离散标记,对应比特率为0.9 kbps。在模拟混响混合、非混响纯净语音及含真实房间脉冲响应的FOA混合信号上的评估显示,重构精度良好,三种条件下的平均角度误差分别为13.76°、3.96°和25.83°。此外,从本编解码器提取的离散潜在表示为下游空间音频任务提供了有效特征,这在基于STARSS23真实录音的声事件定位与检测任务中得到了验证。