One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost. Forward along each block inside Sandglasset, the temporal granularity of the features gradually becomes coarser until reaching half of the network blocks, and then successively turns finer towards the raw signal level. We also unfold that residual connections between features with the same granularity are critical for preserving information after passing through the bottleneck layer. Experiments show our Sandglasset with only 2.3M parameters has achieved the best results on two benchmark SS datasets -- WSJ0-2mix and WSJ0-3mix, where the SI-SNRi scores have been improved by absolute 0.6 dB and 2.4 dB, respectively, comparing to the prior SOTA results.
翻译:一种领先的单一通道语音分离模型(SS)基于一个具有双路分解技术的TasNet, 其每个区段的大小在所有层次上都保持不变。 相反,我们的关键发现是,多色特征对于提高背景建模和计算效率至关重要。我们引入了一个带有一种新型沙玻璃形状的自我注意网络, 即沙沙沙玻璃, 沙沙玻璃, 它以小得多的模型大小和计算成本推进了最先进的SS(SOTA) 的性能。 沿着沙沙玻璃网的每个区块前进, 这些特征的时间颗粒逐渐变得粗糙, 直到达到网络区块的一半, 然后连续地将细微转向原始信号水平。 我们还展示了同一颗粒特征之间的剩余连接对于在穿过瓶颈层后保存信息至关重要。 实验显示,我们只有2.3M参数的Sandcharet在SSS两个基准数据集上取得了最佳结果 -- WSJ0-2mix和WSJ0-3mix, 其中SI-SINRI的得分数分别通过绝对结果比SO0.6 d2.4和SB分别改进了SONTA。