Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC
翻译:精确的序列到序列(seq2seq)对齐对于依赖自动语音识别(ASR)的医疗语音分析和语言学习工具等应用至关重要。当前最先进的端到端(E2E)ASR系统,如基于连接时序分类(CTC)和基于转换器的模型,存在峰值行为和对齐不准确的问题。本文提出了一种基于一维最优传输的新型可微分对齐框架,使模型能够学习单一对齐并以端到端方式执行ASR。我们引入了一种称为序列最优传输距离(SOTD)的伪度量,并讨论了其理论性质。基于SOTD,我们提出了用于ASR的最优时序传输分类(OTTC)损失,并将其行为与CTC进行对比。在TIMIT、AMI和LibriSpeech数据集上的实验结果表明,与CTC及最近提出的正则化一致性CTC相比,我们的方法显著提升了对齐性能,尽管在ASR性能上存在权衡。我们相信这项工作为seq2seq对齐研究开辟了新途径,为社区内的进一步探索和发展提供了坚实基础。我们的代码公开于:https://github.com/idiap/OTTC