Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layer-wise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.
翻译:最近使用自留机制成功应用变异器进行言语分离,其长期依赖性强的建模能力在最近得到了成功。然而,变异器由于深的编码层而往往具有沉重的运行时间成本,这阻碍了其在边缘装置上的部署。一个小变异器模型,其编码器层较少,对于计算效率而言更可取,但它容易发生性能退化。在本文中,建议采用超快的语音分离变异器模型,通过师生学习(T-S学习),既能取得更好的性能和效率。我们引入了分层的T-S学习和客观转换机制,以指导小型学生模型从大型教师模型中学习中间表现。与从零开始训练的小变异器模型相比,拟议的T-S学习方法将多频道和单声道的错误率降低5%以上。我们超快速的语音分离模型利用更多未加标签的语音数据,实现了10%的相对WER降幅。