Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.
翻译:以神经网络为基础的语音编码器最近显示,与传统方法相比,质量有了显著的改善。虽然新一代的编码器能够将高度不忠的语音合成,但它们的复发或共变层的使用往往限制其有效的接收场,从而阻止它们有效地压缩语音。我们提议通过使用预先训练的变异器进一步减少神经语音编码器的比特节,这种变异器能够利用输入信号中的长距离依赖性,因为它们的感知偏向性。因此,我们使用预先训练的变异器与同化编码器同时使用,该变异编码器经过训练,配有四分制和抗争网解密器。我们的数字实验显示,用变异器嵌入语音编码来补充神经调调调调调码器的进动调和调和器的调和器编码,产生一种比特数的调制式调和器,从而在三种变异式语音编码中产生比对比质量的调制法则显示,在四种变异式的调制式调制式调制式调制中,在三种变异式调制式调制式的调制式调制式调制式调制式调制式调制式的调制式调制式调制中,在三种调制制制式调制制制制制制式调制式调制式调制成质量的调制成质量中比制成的调制成的调制。