Frequency domain processing, and in particular the use of Modified Discrete Cosine Transform (MDCT), is the most widespread approach to audio coding. However, at low bitrates, audio quality, especially for speech, degrades drastically due to the lack of available bits to directly code the transform coefficients. Traditionally, post-filtering has been used to mitigate artefacts in the coded speech by exploiting a-priori information of the source and extra transmitted parameters. Recently, data-driven post-filters have shown better results, but at the cost of significant additional complexity and delay. In this work, we propose a mask-based post-filter operating directly in MDCT domain of the codec, inducing no extra delay. The real-valued mask is applied to the quantized MDCT coefficients and is estimated from a relatively lightweight convolutional encoder-decoder network. Our solution is tested on the recently standardized low-delay, low-complexity codec (LC3) at lowest possible bitrate of 16 kbps. Objective and subjective assessments clearly show the advantage of this approach over the conventional post-filter, with an average improvement of 10 MUSHRA points over the LC3 coded speech.
翻译:频率域处理,特别是使用变异分解的科内质变换(MDCT),是最广泛的音频编码方法,然而,在低位位速率和音频质量,特别是语音质量方面,由于缺乏直接编码变异系数的可用比特,音频质量急剧下降。传统上,过滤后用于利用源和额外传输参数的优先信息,在编码语音中减少人工制品。最近,数据驱动的后过滤器显示出更好的结果,但代价是大大增加的复杂程度和延迟。在这项工作中,我们提议在编码器的MDCT域直接运行一个基于遮罩的过滤器后功能,不引起额外的延迟。实际值的遮罩用于四分位化的MDCT系数,并用相对轻的卷变编码解码网络估计。我们的解决办法是在最近标准化的低调、低调调调调调调调调码(LC3)上测试的,最低比特16千字节。客观和主观评估清楚地表明,这一方法优于常规的LUS3后代码。