This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.
翻译:本文介绍BERT-CTC,这是对终端到终端语音识别的一种新表述,它使BERT适应连接式时间分类(CTC),我们的表述放松了常规CTC使用的有条件独立假设,并通过BERT背景嵌入获得的明确产出依赖性纳入了语言知识。BERT-CTC通过自我注意机制关注投入和假设产出序列的整个背景。这一机制鼓励一种模式,在保持CTC培训效率的同时,学习音频和象征性表达之间的内在/相互依存关系。在推断期间,BERT-CTC将遮盖式预设算法与CTC解码结合,反复完善了输出序列。实验结果显示,BERT-CTC在传统方法上通过语言风格和语言的不同而有所改进。最后,我们表明,BERT-CTC的语表有助于下游语言理解任务。