To address the monaural speech enhancement problem, numerous research studies have been conducted to enhance speech via operations either in time-domain on the inner-domain learned from the speech mixture or in time--frequency domain on the fixed full-band short time Fourier transform (STFT) spectrograms. Very recently, a few studies on sub-band based speech enhancement have been proposed. By enhancing speech via operations on sub-band spectrograms, those studies demonstrated competitive performances on the benchmark dataset of DNS2020. Despite attractive, this new research direction has not been fully explored and there is still room for improvement. As such, in this study, we delve into the latest research direction and propose a sub-band based speech enhancement system with perceptually-motivated optimization and dual transformations, called PT-FSE. Specially, our proposed PT-FSE model improves its backbone, a full-band and sub-band fusion model, by three efforts. First, we design a frequency transformation module that aims to strengthen the global frequency correlation. Then a temporal transformation is introduced to capture long range temporal contexts. Lastly, a novel loss, with leverage of properties of human auditory perception, is proposed to facilitate the model to focus on low frequency enhancement. To validate the effectiveness of our proposed model, extensive experiments are conducted on the DNS2020 dataset. Experimental results show that our PT-FSE system achieves substantial improvements over its backbone, but also outperforms the current state-of-the-art while being 27\% smaller than the SOTA. With average NB-PESQ of 3.57 on the benchmark dataset, our system offers the best speech enhancement results reported till date.
翻译:为了解决声调增强问题,进行了许多研究,以便通过以下操作加强言语能力:在从语音混合中学习的内部域的时间空域上,或在固定的全频短短时间Fleier变换光谱的时间频域上,通过时间空域,加强言语能力;最近,提议了几项关于基于亚频段的言语增强的研究;通过次频光谱操作加强演讲,这些研究显示了DNS20基准数据集的竞争性性能;尽管有吸引力,但这一新的研究方向尚未得到充分探索,仍有改进的余地。因此,在本研究中,我们进入最新的研究方向,并提议一个基于基于亚频谱的语音增强系统,包括感知性优化和双频变光。特别,我们提议的PT-FSE模型通过三次努力改进其骨干、全频和子波谱融合模型。首先,我们设计了一个频率变换模块模块模块模块,目的是加强全球频率的改善。随后引入了时间变换,以获取长期的时空背景。最后,一个基于感知频度的子系统更新的语音感测,然后是我们提议的Slovelyal Styal 。