One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost. Forward along each block inside Sandglasset, the temporal granularity of the features gradually becomes coarser until reaching half of the network blocks, and then successively turns finer towards the raw signal level. We also unfold that residual connections between features with the same granularity are critical for preserving information after passing through the bottleneck layer. Experiments show our Sandglasset with only 2.3M parameters has achieved the best results on two benchmark SS datasets -- WSJ0-2mix and WSJ0-3mix, where the SI-SNRi scores have been improved by absolute 0.8 dB and 2.4 dB, respectively, comparing to the prior SOTA results.
翻译:一种领先的单一通道语音分离模型(SS)基于一个具有双路分解技术的TasNet, 其每个区段的大小在所有层次上都保持不变。 相反,我们的关键发现是,多色特性对于提高背景模型和计算效率至关重要。我们引入了一个带有一种新型沙玻璃形状的自我注意网络, 即沙沙沙玻璃, 沙沙沙玻璃, 它以小得多的模型大小和计算成本推进了最先进的SS(SOTA) 的性能。 沿着沙沙沙沙里特的每个区块前进, 这些特性的时间颗粒逐渐变得粗糙,直到达到网络区块的一半,然后连续地将细微转向原始信号水平。 我们还展示了同一颗粒特性之间的剩余连接对于在穿过瓶颈层后保存信息至关重要。 实验显示我们只有2.3M参数的沙沙沙沙沙玻璃公司在SSS两个基准数据集上取得了最佳结果 -- WSJ0-2mix和WSJ0-3mix, 其中SI-SINRI的得分数已经分别通过绝对结果比SO0. 8 d2.4和DB分别改进了SON。