To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.
翻译:为了结合传统和端到端语音识别方法的优势,我们提出了一种简单、新颖和竞争性的语音神经传输模型模型方法。比较了不同的校正标签表,并提议增加基于字端的电话标签来提高性能。利用电话的当地依赖性,我们采用了简化的神经网络结构,并与外部单词级语言模型直接结合,以保持后到等模式的一致性。我们还提出了一个简单、稳定和高效的培训程序,使用框架性跨机体损失。一个的音频环境大小证明足以达到最佳性能。为进一步改进而采用简化的定时抽样方法,并简要比较不同的解码方法。我们的最佳模型的总体性能与TED-LIUM单元2和总机子公司的最新性(SOTA)结果相当。