The Automated Speech Recognition (ASR) community experiences a major turning point with the rise of the fully-neural (End-to-End, E2E) approaches. At the same time, the conventional hybrid model remains the standard choice for the practical usage of ASR. According to previous studies, the adoption of E2E ASR in real-world applications was hindered by two main limitations: their ability to generalize on unseen domains and their high operational cost. In this paper, we investigate both above-mentioned drawbacks by performing a comprehensive multi-domain benchmark of several contemporary E2E models and a hybrid baseline. Our experiments demonstrate that E2E models are viable alternatives for the hybrid approach, and even outperform the baseline both in accuracy and in operational efficiency. As a result, our study shows that the generalization and complexity issues are no longer the major obstacle for industrial integration, and draws the community's attention to other potential limitations of the E2E approaches in some specific use-cases.
翻译:自动语音识别(ASR)社区经历了一个重大转折点,全神经(End-End, E2E)方法的兴起。与此同时,常规混合模式仍然是ASR实际使用的标准选择。根据以往的研究,在现实世界应用中采用E2E ASR受到两个主要限制的阻碍:它们能够对无形领域加以概括,而且操作成本高。在本文件中,我们通过对当代若干E2E模型和混合基线实施综合多领域基准来调查上述两方面的缺陷。我们的实验表明,E2E模型是混合方法的可行替代方法,甚至超越了准确性和业务效率的基线。结果,我们的研究显示,一般化和复杂问题不再是工业一体化的主要障碍,并提请社区注意E2E方法在某些特定使用案例中的其他潜在限制。