We propose a novel neural waveform compression method to catalyze emerging speech semantic communications. By introducing nonlinear transform and variational modeling, we effectively capture the dependencies within speech frames and estimate the probabilistic distribution of the speech feature more accurately, giving rise to better compression performance. In particular, the speech signals are analyzed and synthesized by a pair of nonlinear transforms, yielding latent features. An entropy model with hyperprior is built to capture the probabilistic distribution of latent features, followed with quantization and entropy coding. The proposed waveform codec can be optimized flexibly towards arbitrary rate, and the other appealing feature is that it can be easily optimized for any differentiable loss function, including perceptual loss used in semantic communications. To further improve the fidelity, we incorporate residual coding to mitigate the degradation arising from quantization distortion at the latent space. Results indicate that achieving the same performance, the proposed method saves up to 27% coding rate than widely used adaptive multi-rate wideband (AMR-WB) codec as well as emerging neural waveform coding methods.
翻译:我们提出一种新的神经波压缩方法,以催化新兴的语音语义通信。通过引入非线性变换和变异模型,我们有效地捕捉了语音框架中的依赖性,并更准确地估计了语音特征的概率分布,从而产生更好的压缩性能。特别是,语音信号由一对非线性变换来分析和合成,产生潜在特征。一个具有超优利值的增温模型用来捕捉潜伏特征的概率分布,随后是量化和正反调编码。提议的波形编码可以灵活地优化到任意率,而其他吸引力特征是,它可以很容易优化到任何不同的损失功能,包括在语义通信中使用的感知性损失。为了进一步提高对等性,我们加入了残余编码,以缓解在潜在空间的四分化扭曲造成的退化。结果显示,实现同样的性能,拟议的方法比广泛使用的适应性多角度宽带(AMR-WB)调制解调速率可以节省到27 %。