Deep generative models for Speech Enhancement (SE) received increasing attention in recent years. The most prominent example are Generative Adversarial Networks (GANs), while normalizing flows (NF) received less attention despite their potential. Building on previous work, architectural modifications are proposed, along with an investigation of different conditional input representations. Despite being a common choice in related works, Mel-spectrograms demonstrate to be inadequate for the given scenario. Alternatively, a novel All-Pole Gammatone filterbank (APG) with high temporal resolution is proposed. Although computational evaluation metric results would suggest that state-of-the-art GAN-based methods perform best, a perceptual evaluation via a listening test indicates that the presented NF approach (based on time domain and APG) performs best, especially at lower SNRs. On average, APG outputs are rated as having good quality, which is unmatched by the other methods, including GAN.
翻译:近些年来,促进语音增强的深层基因模型受到越来越多的关注,其中最突出的例子有:创世反转网络(GANs),而正常流动(NF)尽管具有潜力,却没有受到多少关注。在以往工作的基础上,建议进行建筑改造,同时调查不同的有条件投入说明。尽管在有关工程中是一个共同的选择,但Mel-spectrograms显示,对于特定情景来说是不够的。或者,提出了具有高时间分辨率的新颖的AllPole Gammatone过滤库(APG ) 。虽然计算性评价指标显示,基于最先进的GAN方法效果最好,但通过监听测试进行的感知性评价表明,提出的NF方法(基于时间域和APG)表现最佳,特别是在较低的SRRs。平均而言,APG输出被评为质量良好,而其他方法,包括GAN方法都与质量不相称。