In this paper, we propose a two-stage heterogeneous lightweight network for monaural speech enhancement. Specifically, we design a novel two-stage framework consisting of a coarse-grained full-band mask estimation stage and a fine-grained low-frequency refinement stage. Instead of using a hand-designed real-valued filter, we use a novel learnable complex-valued rectangular bandwidth (LCRB) filter bank as an extractor of compact features. Furthermore, considering the respective characteristics of the proposed two-stage task, we used a heterogeneous structure, i.e., a U-shaped subnetwork as the backbone of CoarseNet and a single-scale subnetwork as the backbone of FineNet. We conducted experiments on the VoiceBank + DEMAND and DNS datasets to evaluate the proposed approach. The experimental results show that the proposed method outperforms the current state-of-the-art methods, while maintaining relatively small model size and low computational complexity.
翻译:在本文中,我们提出一个两阶段的多元光量网络,用于提高音调。具体地说,我们设计了一个新型的两阶段框架,包括粗糙的全带面罩估计阶段和一个精细的低频改进阶段。我们没有使用手工设计的实际价值过滤器,而是使用一个新颖的可学习的复杂价值的矩形宽频(LCRB)过滤器作为紧凑特性的提取器。此外,考虑到拟议的两阶段任务各自的特点,我们使用了一种混合结构,即一个U形的子网络,作为CoarseNet的主干网,一个单尺度的子网络作为FineNet的主干网。我们在VoiceBank + DEMAND和DNS数据集上进行了实验,以评价拟议的方法。实验结果表明,拟议的方法超越了目前最先进的方法,同时保持相对小的模型大小和低的计算复杂性。