In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.
翻译:在这项工作中,我们提出CleanUNet,这是原始波形上的因果言分解模型,拟议的模型以编码器分解器结构为基础,加上若干自我注意块来改进其瓶颈表示方式,这对于取得良好结果至关重要。该模型通过对波形和多分辨率光谱图界定的一系列损失加以优化。从各种客观和主观评价指标来看,拟议方法在取消语句质量方面优于最先进的模型。