We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain. AERO is based on an encoder-decoder architecture with U-Net like skip connections. We optimize the model using both time and frequency domain loss functions. Specifically, we consider a set of reconstruction losses together with perceptual ones in the form of adversarial and feature discriminator loss functions. To better handle phase information the proposed method operates over the complex-valued spectrogram using two separate channels. Unlike prior work which mainly considers low and high frequency concatenation for audio super-resolution, the proposed method directly predicts the full frequency range. We demonstrate high performance across a wide range of sample rates considering both speech and music. AERO outperforms the evaluated baselines considering Log-Spectral Distance, ViSQOL, and the subjective MUSHRA test. Audio samples and code are available at https://pages.cs.huji.ac.il/adiyoss-lab/aero
翻译:我们展示了超分辨率的音频模型AERO,它处理光谱域内的语音和音乐信号。AERO基于一个有U-Net(U-Net)的编码器解码器结构,例如跳过连接。我们利用时间和频率域损失功能优化该模型。具体地说,我们考虑一组重建损失,同时以对抗和特征歧视损失功能的形式考虑一套概念损失。为了更好地处理阶段信息,拟议方法使用两个不同的频道对复杂价值的光谱图操作。与以前主要考虑音频超分辨率低和高频率连接的工作不同,拟议方法直接预测全频率范围。我们从语音和音乐两方面展示了广泛的样本率。AERO超越了考虑到逻辑-光谱距离、VISQOL和主观 MUSHRA测试的评估基线。音频样本和代码可在https://pages.cs.huji.ac.il/adiyos-lab/aero查阅。