Features such as punctuation, capitalization, and formatting of entities are important for readability, understanding, and natural language processing tasks. However, Automatic Speech Recognition (ASR) systems produce spoken-form text devoid of formatting, and tagging approaches to formatting address just one or two features at a time. In this paper, we unify spoken-to-written text conversion via a two-stage process: First, we use a single transformer tagging model to jointly produce token-level tags for inverse text normalization (ITN), punctuation, capitalization, and disfluencies. Then, we apply the tags to generate written-form text and use weighted finite state transducer (WFST) grammars to format tagged ITN entity spans. Despite joining four models into one, our unified tagging approach matches or outperforms task-specific models across all four tasks on benchmark test sets across several domains.
翻译:标点、 资本化、 实体格式化等特性对于可读性、 理解性和自然语言处理任务很重要 。 但是, 自动语音识别( ASR) 系统生成不格式化的语音格式文本, 并同时标记格式化地址的一到两个特性。 在本文中, 我们通过一个两阶段的过程, 将语音转换成文字 。 首先, 我们使用一个单一的变压器标记模型来联合生成符号级标记, 用于反文本正常化( ITN )、 标点、 资本化和不协调 。 然后, 我们应用这些标记来生成书面格式文本, 并使用加权限量的调色器( WFST) 语法来格式化标记有标签的 ITN 实体 。 尽管我们将四个模型合并成一个, 我们的统一标记方法匹配或超过所有四个基准测试任务的具体任务模式, 跨越多个领域 。