Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thousands of biasing words efficiently into a symbolic prefix-tree and creates a neural shortcut between the tree and the final ASR output to facilitate the recognition of the biasing words. To enhance TCPGen, we further propose a novel minimum biasing word error (MBWE) loss that directly optimises biasing word errors during training, along with a biasing-word-driven language model discounting (BLMD) method during the test. All contextual ASR systems were evaluated on the public Librispeech audiobook corpus and the data from the dialogue state tracking challenges (DSTC) with the biasing lists extracted from the dialogue-system ontology. Consistent word error rate (WER) reductions were achieved with TCPGen, which were particularly significant on the biasing words with around 40\% relative reductions in the recognition error rates. MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words. TCPGen also achieved zero-shot learning of words not in the audio training set with large WER reductions on the out-of-vocabulary words in the biasing list.
翻译:环境知识对于减少高价值长尾单词的语音识别错误至关重要。 本文提议了一个新颖的树节指针生成器( TCPGen), 使端到端的 ASR 模型能够偏向于使用外部背景信息获得的长尾单词列表。 由于记忆使用和计算成本方面的管理费用很小, TCPGen 能够将数千个偏差词有效地组织成一个象征性的前缀树, 并在树和最后的 ASR 输出之间创造神经捷径, 以便于识别偏向词。 为了加强 TCPGen, 我们进一步提议了一个新颖的最小偏差字错误( MBWWWWW) 错误( MWWWWE), 使培训中偏差字的偏差直接产生偏差, 以及测试中偏差语言模型( BLMD) 的折扣方法。 所有背景的 ASR 系统都通过公共 Librispeech 音箱和对话状态跟踪挑战的数据( DSTC) 和从对话- 系统取出的偏差列表中得出偏差单词的列表, 与TCPG 在TCPG 的降为显著的减幅上取得了显著的偏差值。 在TERG 列中,在TBLLL 的减幅上取得了显著的减率。