Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition (ASR). It converts numbers, dates, abbreviations, and other semiotic classes from the spoken form generated by ASR to their written forms. One can consider ITN as a Machine Translation task and use neural sequence-to-sequence models to solve it. Unfortunately, such neural models are prone to hallucinations that could lead to unacceptable errors. To mitigate this issue, we propose a single-pass token classifier model that regards ITN as a tagging task. The model assigns a replacement fragment to every input token or marks it for deletion or copying without changes. We present a dataset preparation method based on the granular alignment of ITN examples. The proposed model is less prone to hallucination errors. The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets. One-to-one correspondence between tags and input words improves the interpretability of the model's predictions, simplifies debugging, and allows for post-processing corrections. The model is simpler than sequence-to-sequence models and easier to optimize in production settings. The model and the code to prepare the dataset is published as part of NeMo project.
翻译:反文本正常化( ITN) 是自动语音识别( ASR) 中一个至关重要的后处理步骤。 它将数字、 日期、 缩略语和其他半印类从 ASR 生成的口述形式转换为书面形式。 可以将 ITN 视为机器翻译任务, 并使用神经序列序列序列模型来解决它。 不幸的是, 这些神经模型容易产生幻觉, 可能导致不可接受的错误。 为了缓解这一问题, 我们提议了一个将 ITN 视为标记任务的单一的象征性分类模型。 该模型为每个输入符号指定一个替换碎片, 或为删除或复制标记它而不作更改。 我们根据IMTN 示例的颗粒调整提出数据集编制方法。 提议的模型不易出现幻觉错误。 该模型在Google Text 正常化数据集上接受培训, 并在英语和俄语测试组中实现最先进的句句准确性。 标签和输入词之间的一对一对一对一通信提高了模型预测的可解释性, 简化了调制错, 并且允许在后处理模型中进行最优化的修改。 模式比简化的顺序要更简单。 。