We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.
翻译:我们展示了“SoundStream”这个新型神经音调码器,它能有效地压缩语言、音乐和一般音频,通常以语音定制的调制解码器为目标。“SoundStream”依靠由完全进化的编码/解码网络和残余矢量量量量计算器组成的模型结构,这些模型由经过共同培训的端对端至端共同培训。培训利用了文字对语音和语音增强方面的最新进展,这些进展结合了对抗性损失和重建性损失,以便从量化的嵌入中生成高质量的音频内容。通过对质层应用结构化的退出培训,单一模型可以在3kbps至18kbps的变异位器上运行,与固定比位数模型相比,其质量损失微不足道。此外,该模型可用于低液度执行,支持可流的推断,并实时运行在智能手机的CPU上。在使用24kHz取样率进行主观评价时,在3kbperforms外层应用了结构化的音频,因此,单一模型可在12kder Oppermillement 上进行额外的反压。