We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec.
翻译:本文介绍LMCodec,一种因果神经语音编解码器,可在极低的比特率下提供高质量的音频。该系统的主干是因果卷积编解码器,使用残差向量量化将音频编码为从粗到细的层次结构的标记。LMCodec训练Transformer语言模型以生成的方式从粗标记预测细标记,从而允许传输更少的代码。第二个Transformer根据过去已传输的代码预测下一个代码的不确定性,并用于执行有条件熵编码。进行了MUSHRA主观测试,并显示质量与高比特率下的参考编解码器相当。可在https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec 上找到示例音频。