Recently, several very effective neural approaches for single-channel speech separation have been presented in the literature. However, due to the size and complexity of these models, their use on low-resource devices, e.g. for hearing aids, and earphones, is still a challenge and established solutions are not available yet. Although approaches based on either pruning or compressing neural models have been proposed, the design of a model architecture suitable for a certain application domain often requires heuristic procedures not easily portable to different low-resource platforms. Given the modular nature of the well-known Conv-Tasnet speech separation architecture, in this paper we consider three parameters that directly control the overall size of the model, namely: the number of residual blocks, the number of repetitions of the separation blocks and the number of channels in the depth-wise convolutions, and experimentally evaluate how they affect the speech separation performance. In particular, experiments carried out on the Libri2Mix show that the number of dilated 1D-Conv blocks is the most critical parameter and that the usage of extra-dilation in the residual blocks allows reducing the performance drop.
翻译:最近,文献中提出了几种非常有效的单通道语音分离神经方法,然而,由于这些模型的规模和复杂性,这些模型在低资源装置(例如助听器和耳机)上的使用仍是一个挑战,尚没有既定的解决办法。虽然提出了基于修剪或压缩神经模型的方法,但设计适合某一应用领域的模型结构往往需要不易被不同低资源平台所轻易移动的超常程序。鉴于众所周知的Conv-Tasnet语音分离结构的模块性质,我们本文认为直接控制该模型总体规模的三个参数,即:残留区的数目、分离区块的重复次数和深度演进中的通道数目,以及实验性地评估它们如何影响语音分离性能。特别是,在Libri2Mix上进行的实验表明,稀释的1D-Con区块数目是最关键的参数,在残余区内使用外差关系可以减少性能下降。</s>