The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We apply time downsampling methods for efficient training and use transposed convolutions to upsample the output sequence again. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results compared to other architectures. It generalizes very well on Hub5'01 test set and outperforms the BLSTM-based hybrid model significantly.
翻译:最近提议的校正结构已成功地用于终端到终端自动语音识别(ASR)结构,在不同的数据集上达到最先进的性能。据我们所知,没有调查使用校正声学模型对混合的ASR的影响。在本文中,我们介绍和评价了一种竞争性校正型混合培训食谱。我们研究了不同的培训方面和方法,以改进单体率并加快培训速度。我们运用了时间下游抽样方法来进行高效培训,并使用转换的演进来更新产出序列。我们在交换机300小时数据集上进行了实验,我们的校正混合模型与其他结构相比取得了竞争性结果。它概括了HUB5'01测试集,大大优于基于BLSTM的混合模型。