Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).
翻译:在过去几十年中,自动语音识别系统已大为改善,目前的系统主要以混合为基础,以端到端为基础,最近提出的CTC-CRF框架继承了混合方法的数据效率和端到端方法的简单性。在本文件中,我们进一步推广以CTC-CRF为基础的自动语音识别技术,在模型单位和神经结构方面进行探索。具体地说,我们调查各种技术,使最近开发的字形建模器和内线网络能够在CTC-CRFs中得到妥善应用。在两种英国数据集(Switchboard, Librispeech)和来自CondVoice的德国数据集上进行了实验。实验结果表明,(i) Conserve可以大大改进识别性;(ii) 以字形为基础的系统比以手机为基础的目标语言系统稍差一点,其笔式话通信程度较低(例如英语),而当对目标语言而言这种通信程度较高时,两种系统也可以同样强劲地运作(例如德语)。