Real-time communications in packet-switched networks have become widely used in daily communication, while they inevitably suffer from network delays and data losses in constrained real-time conditions. To solve these problems, audio packet loss concealment (PLC) algorithms have been developed to mitigate voice transmission failures by reconstructing the lost information. Limited by the transmission latency and device memory, it is still intractable for PLC to accomplish high-quality voice reconstruction using a relatively small packet buffer. In this paper, we propose a temporal memory generative adversarial network for audio PLC, dubbed TMGAN-PLC, which is comprised of a novel nested-UNet generator and the time-domain/frequency-domain discriminators. Specifically, a combination of the nested-UNet and temporal feature-wise linear modulation is elaborately devised in the generator to finely adjust the intra-frame information and establish inter-frame temporal dependencies. To complement the missing speech content caused by longer loss bursts, we employ multi-stage gated vector quantizers to capture the correct content and reconstruct the near-real smooth audio. Extensive experiments on the PLC Challenge dataset demonstrate that the proposed method yields promising performance in terms of speech quality, intelligibility, and PLCMOS.
翻译:为解决这些问题,开发了语音包隐藏算法,以通过重建丢失的信息来减少语音传输失败。由于传输延迟和装置内存的限制,PLC仍难以用相对较小的包缓冲实现高质量语音重建。在本文件中,我们提议为音频PLC(称为TMGAN-PLC)建立一个时间记忆质变对抗网络,由新颖的嵌套UNet生成器和时空/频域区分器组成。具体来说,在发电机中精心设计了嵌套UNet和时地性线性调制组合,以微调内部信息,并建立起一个相对较小的包缓冲。为了补充长期损失爆发造成的缺失的语音内容,我们采用了多级封口矢量量量定量测试器,以捕取正确的内容,并重建近乎平稳的音频带。关于嵌套式-域/频率区分器的大规模实验,在高亮度的磁带效果中展示了PLC系统质量的预期性能。