Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong baselines, while being significantly smaller in size and retaining customization capabilities.
翻译:自动语音识别(ASR)系统通常会以词汇形式产生输出。 但是, 人类更喜欢书面形式输出。 为了缩小这一差距, ASR系统通常使用反文本正常化( ITN) 。 在以往的作品中, 使用重度自重国家转换器( WFST) 来进行 ITN 。 WFST 系统非常适合这项工作, 但是其大小和运行时间成本可以使嵌入应用程序的部署具有挑战性。 在本文中, 我们描述一个流流、 轻量和准确的在设备上安装的 ITN 系统的开发情况。 在我们系统的核心是一条流式变压器, 标有来自 ASR 的词汇标记。 标签显示, 哪些 ITN 类( WFST) 可能被使用, 之后, 我们只对有标签的文本应用ITN 类特定 WFST 来可靠地进行 ITN 转换。 我们显示, 拟议的 ITN 解决方案与强的基线相当, 同时在大小和保持定制能力上大大缩小 。