Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-speaker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguistic. This inspires us to develop a Sidecar separator to empower a well-trained ASR model for multi-speaker scenarios by separating the mixed speech embedding between two suitable layers. We experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By freezing the parameters of the original model and training only the Sidecar (8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset, reaching a word error rate (WER) of 10.36%; and obtains comparable results (7.56%) for LibriSpeechMix dataset when limited training.
翻译:虽然自动语音识别(ASR)在共同的不重叠环境中效果良好,但维持多发言者重叠语音识别的性能仍具有挑战性。最近的研究表明,ASR模型的编码器捕捉了不同层的不同层次的信息 -- -- 低层往往拥有更多的声学信息,而上层语言则更为丰富。这促使我们开发了侧侧车分隔器,通过将混合语音嵌入两个合适的层来增强经过良好训练的多声频场景的ASR模型。我们试验了一个基于 wav2vec 2.0的ASR模型,并安装了侧车。通过冻结原始模型的参数并仅培训侧面车(8.7 M,所有参数的8.4%),拟议方法大大超越了2位讲者混集LibriMix数据集的原有状态,达到10.36%的字差率;在有限培训时,LibriSpeechMix数据集获得类似的结果(7.56%)。