Complex spectrum and magnitude are considered as two major features of speech enhancement and dereverberation. Traditional approaches always treat these two features separately, ignoring their underlying relationship. In this paper, we proposem Uformer, a Unet based dilated complex & real dual-path conformer network in both complex and magnitude domain for simultaneous speech enhancement and dereverberation. We exploit time attention (TA) and dilated convolution (DC) to leverage local and global contextual information and frequency attention (FA) to model dimensional information. These three sub-modules contained in the proposed dilated complex & real dual-path conformer module effectively improve the speech enhancement and dereverberation performance. Furthermore, hybrid encoder and decoder are adopted to simultaneously model the complex spectrum and magnitude and promote the information interaction between two domains. Encoder decoder attention is also applied to enhance the interaction between encoder and decoder. Our experimental results outperform all SOTA time and complex domain models objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on the blind test set of Interspeech 2021 DNS Challenge, which outperforms all top-performed models. We also carry out ablation experiments to tease apart all proposed sub-modules that are most important.
翻译:复杂的频谱和规模被视为语音增强和偏差的两个主要特征。 传统方法总是将这两个特征分开处理, 忽略其内在关系。 在本文中, 我们提议在复杂和规模域内, 以基于铀的放大复杂和真实的双路径兼容网络, 在复杂和规模域内, 用于同时增强语音和皮肤畸变。 我们利用时间关注( TA) 和放大变异( DC) 来利用本地和全球背景信息和频率关注模型的维度信息。 这三个子模块包含在拟议的扩展复杂和真实双向相容模块中, 有效地改进了语音增强和皮肤变异性功能。 此外, 混合编码器和解码器被同时用于模拟复杂的频谱和规模, 并促进两个领域之间的信息互动 。 编码器解码器还被用于加强编码器和变异( DDC) 之间的相互作用。 我们的实验结果以客观和主观的方式超越了所有SOTA时间和复杂域模型。 具体地说, Ufred 将3. 6032 DNSMOS 放在了In Indepeach made develop comlaction All 2021 最重要的实验模型中, Extravelop for wes flavelop exformaxeformlations