Several trade-offs need to be balanced when employing monaural speech separation (SS) models in conversational automatic speech recognition (ASR) systems. A larger SS model generally achieves better output quality at an expense of higher computation, meanwhile, a better SS model for overlapping speech often produces distorted output for non-overlapping speech. This paper addresses these trade-offs with a sparsely-gated mixture-of-experts (MoE). The sparsely-gated MoE architecture allows the separation models to be enlarged without compromising the run-time efficiency, which also helps achieve a better separation-distortion trade-off. To further reduce the speech distortion without compromising the SS capability, a multi-gate MoE framework is also explored, where different gates handle non-overlapping and overlapping frames differently. ASR experiments are conducted by using a simulated dataset for measuring both the speech separation accuracy and the speech distortion. Two advanced SS models, Conformer and WavLM-based models, are used as baselines. The sparsely-gated MoE models show a superior SS capability with less speech distortion, meanwhile marginally increasing the run-time computational cost. Experimental results using real conversation recordings are also presented, showing MoE's effectiveness in an end-to-end evaluation setting.
翻译:在对口自动语音识别系统(ASR)中采用调音语音分离模型(SS)时,需要平衡几种取舍。更大的SS型模式通常能以更高的计算成本实现更好的产出质量。同时,更好的SS型重叠语音模型往往会扭曲非重叠性演讲的输出。本文用分散式专家混合体(MoE)处理这些权衡。分散式教育部结构允许在不损害运行效率的情况下扩大分离模型,这也有助于实现更好的分离扭曲性交易。为了进一步减少语音扭曲,同时又不损害SS的能力,还探索一个多门的MOE框架,不同门处理不重叠和重叠性框架。ASR实验采用模拟数据集衡量语音分离准确性和扭曲性演讲性。两种先进的SS型模型(Conorect 和WavLM 模型)被用作基线。稀疏式的MOE型模型显示较高的SS型能力,减少语音扭曲,同时略微增加运行式计算成本。实验性实验性对话结果也显示实际的结束式对话结果。