统一流式和非流式运输工具中背景偏差的两个阶段背景词过滤 (Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer)

It is difficult for an end-to-end (E2E) ASR system to recognize words such as named entities appearing infrequently in the training data. A widely used method to mitigate this issue is feeding contextual information into the acoustic model. A contextual word list is necessary, which lists all possible contextual word candidates. Previous works have proven that the size and quality of the list are crucial. A compact and accurate list can boost the performance significantly. In this paper, we propose an efficient approach to obtain a high quality contextual word list for a unified streaming and non-streaming based Conformer-Transducer (C-T) model. Specifically, we make use of the phone-level streaming output to first filter the predefined contextual word list. During the subsequent non-streaming inference, the words in the filtered list are regarded as contextual information fused into non-casual encoder and decoder to generate the final recognition results. Our approach can take advantage of streaming recognition hypothesis, improve the accuracy of the contextual ASR system and speed up the inference process as well. Experiments on two datasets demonstrates over 20% relative character error rate reduction (CERR) comparing to the baseline system. Meanwile, the RTF of our system can be stabilized within 0.15 when the size of the contextual word list grows over 6,000.

翻译：终端到终端( E2E) ASR 系统很难识别诸如在培训数据中不经常出现的命名实体等单词。缓解这一问题的一种广泛使用的方法是将背景信息输入声学模型。需要一份背景词列表, 列出所有可能的背景词候选人。以前的工作证明名单的大小和质量至关重要。压缩和准确的清单可以显著提高性能。在本文中, 我们提出一个高效的方法, 获取高质量的背景词列表, 用于基于统一流流和非流的基于 Condect- Exporter (C- T) 模型的组合和非流的列表。具体地说, 我们使用电话级流输出首次过滤预先定义的背景词列表。在随后的非流动推论中, 过滤列表中的文字被视为背景信息, 并结合到非连续的编码编码编码和解码来生成最后的识别结果。我们的方法可以利用流化识别假设, 提高背景 ASR 系统的准确性能, 并加快导进程。在将20% 的 Rwi 的图像列表比标系统缩小了0. 。