Neural audio/speech coding has shown its capability to deliver a high quality at much lower bitrates than traditional methods recently. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies inside encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end way. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. What's more, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid on main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that with a latency of 40ms, the proposed TF-Codec at 1kbps can achieve a much better quality than Opus 9kbps and TF-Codec at 3kbps outperforms both EVS 9.6kbps and Opus 12kbps. Numerous studies are conducted to show the effectiveness of these techniques.
翻译:神经声/ 声音/ 声音编码显示它有能力在比最近传统方法低得多的比特节中提供高质量的高品质, 但是, 现有的神经音/ 声音编码器使用声音功能或学习的盲点功能, 需要编码, 从而在编码功能中仍然存在时间冗余。 本文向 VQ- VAE 框架引入了潜在多功能预测编码, 以完全清除这种冗余, 并提议以端到端方式为低纬度神经语音编码提供TF- Codec 。 具体地说, 提取的功能以从过去量化的潜层框架预测为条件, 以便进一步删除时间相关性。 更重要的是, 我们对时间频率输入进行可学习的压缩, 以适应性地调整在不同位数中主要频率和细节上的注意。 基于远程到软性绘图和 Gumbel- Softmax 的不同矢量矢量的矢量量计方案, 是要用速限制的更佳模型来显示潜在分布。 在多语语言语音数据框架中, 的主观结果可以显示比 1 的 Opps- 显示质量要好, 在 12 TFTF 中, 在 显示 的 显示 4 显示 的 4 3 3 显示 显示 显示 的 的 的 和 度为 度为 的 度为 的 度为 的 。