For monaural speech enhancement, contextual information is important for accurate speech estimation. However, commonly used convolution neural networks (CNNs) are weak in capturing temporal contexts since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human auditory perception to introduce a two-stage trainable reasoning mechanism, referred as global-local dependency (GLD) block. GLD blocks capture long-term dependency of time-frequency bins both in global level and local level from the noisy spectrogram to help detecting correlations among speech part, noise part, and whole noisy input. What is more, we conduct a monaural speech enhancement network called GLD-Net, which adopts encoder-decoder architecture and consists of speech object branch, interference branch, and global noisy branch. The extracted speech feature at global-level and local-level are efficiently reasoned and aggregated in each of the branches. We compare the proposed GLD-Net with existing state-of-art methods on WSJ0 and DEMAND dataset. The results show that GLD-Net outperforms the state-of-the-art methods in terms of PESQ and STOI.
翻译:对于提高声调而言,背景信息对于准确的言语估计很重要。然而,常用的进化神经网络(CNNs)在捕捉时间背景方面很薄弱,因为它们只能同时构筑一个地方邻居的路障。为了解决这个问题,我们从人类的听觉观念中学习,引入一个两阶段的可培训的推理机制,称为全球-地方依赖(GLD)区块。GLD区块从全球和地方一级的噪音光谱图中捕捉时间频箱的长期依赖性,以帮助探测语音部分、噪音部分和整个噪音输入之间的关联性。此外,我们运行一个称为GLD-Net的月度语音增强网络,它采用编码-解密结构,由语音对象分支、干扰分支和全球噪音分支组成。全球和地方各级的抽取演讲特征都是有效的推理和汇总的。我们比较了拟议的GLD-Net与WSJ0和DEAND数据集的现有最新方法。结果显示,GLD-Net在公共服务和STOQ方面超越了状态-art方法。