Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.
翻译:以变形器为基础的模型在调音器分离方面提供了显著的性能改进。 然而, 与最近提议的上框相比, 绩效差距仍然存在。 目前双路变换模型的主要局限性在于远程元素互动和本地特征模式的模拟效率低下。 在这项工作中, 我们通过提出一个封闭的单头变压器结构, 配有革命增强的联合自控, 名为\ textit{ Moss Former} (\ textit{Mo}Mothalit{Textit{s} 和最近提议的上框。 目前双路变换模型的主要局限性是: 有效解决双路结构中各块之间的间接元素互动。 在双路结构中, Mosformer采用一个联合的本地和全球自控结构, 同时对本地块进行全面自控, 以线化的低成本自我维护整个序列。 联合关注使得MossForm 模型全序列元素直接互动。 此外, 我们采用了一个强大的对J- 2 B 和 B 高级自定义 模型的高级自定义定位机制, 以简化的自定义模型为我们23 。 的自定义的自定义 的自定义模型 。