This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation.
翻译:本文针对的是长语类演讲和谈话演讲等长语录音的端到端自动语音识别(ASR ) 。 大多数端到端 ASR 模型的设计是为了识别独立言论,但对于多个语句的上下文信息(例如演讲者或议题)已知对ASR有用。在先前的工作中,我们提议了一个上下文扩展变换器,接受多个连续发音,同时接受多个连续发音,并预测最后发音的输出序列,从演讲和谈话性ASR基准的发音基线中实现5-15%相对错误的减少。尽管结果显示业绩显著提高,但仍有可能进一步改进模式架构和解码进程。在本文件中,我们扩展我们先前的工作, (1) 引入 Confert 架构,以进一步提高准确性,(2) 加速解码进程,同时接受新的激活回收技术,(3) 启动流解码。 我们证明,扩展变换变换器提供了最新艺术至终端的ASR SR 性能,在50 级模型中获得了17.3%的字符误差率率,通过Scust-300 测试方法降低 Ermal-% 的 Ral-60/ 6. 降低调调调调