Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.
翻译:最近端到端的神经音频/语音编码显示它具有超越传统信号分析基于音频编码的传统信号分析的巨大潜力,这主要通过学习、矢量定量和编码盲点的VQ-VAE模式来实现。在本文中,我们提议学习实时神经语音编码的分解特征,而不是盲点端到端的学习。具体地说,更多全球相似的语音身份和本地内容特征以分解方式学习来代表语音。这种紧凑特征的分解不仅通过在不同功能之间利用点分配而实现更好的编码效率,而且还提供了在嵌入空间(如实时通信中的语音转换)进行音频编辑的灵活性。主观和客观结果都显示了其编码效率,我们发现,所学的分解特征显示了任何语音转换的相似性能,与现代自强的语音代言学习模式相比,参数要少得多,且拉长度低,显示了我们神经编码框架的潜力。