Speech separation models are used for isolating individual speakers in many speech processing applications. Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks. One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks. A limitation of these models is that they have a fixed receptive field (RF). Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal. In this work deformable convolution is proposed as a solution to allow TCN models to have dynamic RFs that can adapt to various reverberation times for reverberant speech separation. The proposed models are capable of achieving an 11.1 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement over the input signal on the WHAMR benchmark. A relatively small deformable TCN model of 1.3M parameters is proposed which gives comparable separation performance to larger and more computationally complex models.
翻译:在许多语音处理应用程序中,使用语音分离模型来隔离个别演讲者。深层学习模型已经证明可以导致在若干语音分离基准上取得最先进的(SOTA)结果。一类被称为时间变迁网络的模型已经显示了对语音分离任务的有希望的结果。这些模型的局限性在于它们有一个固定的可接受域。最近对语音脱节的研究表明,一种语音变换式的最好的RF与语音信号的回校特点不同。在这项工作中,提出了一种可变化的共变式模型作为解决办法,使TCN模型能够适应各种回动式语音分离的反动时间。拟议的模型能够实现11.1 dB 平均比例变换信号对扭曲比率(SNAIDS) 相对于WHAMR基准输入信号的改进。提出了一种相对小的可变式TCN 1.3M参数模型,该模型的分离性能与更大型、更复杂的计算模型具有可比性。