Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.
翻译:Bitrate 缩放率是实时通信中音频编码的可取特征。 现有的神经音调编码器通常在培训期间强制实施特定的比特率, 因此需要为每个目标比特率培训不同的模型, 从而增加发送者和接收者的记忆足迹, 并经常需要转换编码来支持多个接收器。 在本文中, 我们引入了一个跨尺度的可缩放矢量量化方案( CSVQ ), 多尺度的功能在逐步编码, 并配有分级特性的聚合和精细化。 这样, 如果只接收了比特流的一部分, 并且随着更多位子的可用而逐步改进质量, 则对现有神经级的模型进行培训。 拟议的 CSVQ 方案可以灵活地应用到任何带有镜像自动编码结构的神经音频调网络, 以达到比特度的缩放率。 主观结果显示, 拟议的方案比经典残余VQ( RVQ) 的缩放率要高得多。 此外, 3 kbps 的拟议 CSVQ 可以在9 kbps 和 Lybragration 3 kpress 上提供优度质量。