We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.
翻译:我们引入了最新的实时、高友谊度、音频调解码辅助神经网络,由流成的编码器-解码器结构组成,以端到端培训的量化潜在空间构成。我们通过使用单一的多尺度光谱对称来简化和加速培训,从而有效减少文物并产生高质量的样本。我们引入了一种新的损失平衡机制以稳定培训:损失的重量现在界定了它代表的总梯度的分数,从而将这一超参数的选择与典型的损失规模脱钩。最后,我们研究如何使用轻质变换器模型进一步压缩40%以上的代表性,同时保持比实际时间更快。我们详细说明了拟议模型的主要设计选择,包括:培训目标、建筑变化和各种感知损失功能的研究。我们介绍了广泛的主观评价(MUSHRA测试),同时介绍了一系列带宽和音频域的对比研究,包括语音、回声波变声器/音频模型,以及所有可使用的KH型号音频/音频模型都是高压的。我们在48号音频/音频模型上采用的方法。