Traditional low bit-rate speech coding approach only handles narrowband speech at 8kHz, which limits further improvements in speech quality. Motivated by recent successful exploration of deep learning methods for image and speech compression, this paper presents a new approach through vector quantization (VQ) of mel-frequency cepstral coefficients (MFCCs) and using a deep generative model called WaveGlow to provide efficient and high-quality speech coding. The coding feature is sorely an 80-dimension MFCCs vector for 16kHz wideband speech, then speech coding at the bit-rate throughout 1000-2000 bit/s could be scalably implemented by applying different VQ schemes for MFCCs vector. This new deep generative network based codec works fast as the WaveGlow model abandons the sample-by-sample autoregressive mechanism. We evaluated this new approach over the multi-speaker TIMIT corpus, and experimental results demonstrate that it provides better speech quality compared with the state-of-the-art classic MELPe codec at lower bit-rate.
翻译:由于最近成功地探索了图像和语音压缩的深层学习方法,本文件介绍了一种新的方法,即:Mel-Central Cepstral 系数(MFCCs)的矢量量化(VQ),并使用称为WaveGlow的深层基因模型提供高效和高质量的语音编码。编码特征是:16kHz宽带的80度MFCCs矢量,然后在整个1000-2000位/秒的位/秒的比特率中,通过对MFCs矢量应用不同的VQ方案,可以大规模地实施比特调编码。这种新的深层基因网络代码快速发挥作用,因为WaveGlow模型放弃了抽样逐个抽样自动递增机制。我们评估了多位演讲TIMTMChorp的这一新方法,实验结果显示,与低位位位点仪的高级经典MELPec代码相比,它提供了更好的语音质量。