In this paper, we present a new open source, production first and production ready end-to-end (E2E) speech recognition toolkit named WeNet. The main motivation of WeNet is to close the gap between the research and the production of E2E speech recognition models. WeNet provides an efficient way to ship ASR applications in several real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. This paper introduces WeNet from three aspects, including model architecture, framework design and performance metrics. Our experiments on AISHELL-1 using WeNet, not only give a promising character error rate (CER) on a unified streaming and non-streaming two pass (U2) E2E model but also show reasonable RTF and latency, both of these aspects are favored for production adoption. The toolkit is publicly available at https://github.com/mobvoi/wenet.
翻译:在本文中,我们介绍了一个新的开放源码、生产第一和生产端到端的语音识别工具包(E2E),名为WeNet。WeNet的主要动机是缩小研究与E2E语音识别模型制作之间的差距。我们网络为在几个现实世界情景下运送ASR应用提供了有效的方法,这是其他开放源码E2E语音识别工具包的主要区别和优势。本文介绍了WeNet的三个方面,包括模型结构、框架设计和性能衡量标准。我们利用WeNet进行的AISELL-1实验不仅给出了统一流出和非流出两个传(U2)E2E模式的有希望性格错误率(CER),而且还展示了合理的RTF和Latency,这两个方面都有利于生产。该工具包可在https://github.com/mobvoi/wenet上公开查阅。