Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.
翻译:最近,我们提供了WENet,这是一个面向生产、端到端的语音识别工具,它引入了统一的双通道(U2)框架和一个内在的运行时间,以解决单一模式中的流式和非流式解码模式;为了进一步改进ASR性能,便利各种生产要求,我们在本文件中提供了WeNet 2.0,并提供了四项重要更新:(1) 我们提出U2+++,这是一个统一的双向关注双向关注双向访问双向双向访问双向访问的双向访问框架,其中包括通过右向左关注解码器的未来背景信息,以提高共享编码器的代表性能力和重整阶段的性能。 (2) 我们采用基于ng的语文模型和基于WFFST的解码器进入WNet2.0,促进在生产情景中使用丰富的文本数据。 (3) 我们设计一个统一的背景偏差框架,利用用户特有的背景(例如联系名单),为生产提供快速适应能力,提高ASR在LM和无LM情景下的准确度。 (4) 我们设计一个统一的IO-O,以新的网络模式支持大规模更新现有业绩模型。