This paper presents an extension to train end-to-end Context-Aware Transformer Transducer ( CATT ) models by using a simple, yet efficient method of mining hard negative phrases from the latent space of the context encoder. During training, given a reference query, we mine a number of similar phrases using approximate nearest neighbour search. These sampled phrases are then used as negative examples in the context list alongside random and ground truth contextual information. By including approximate nearest neighbour phrases (ANN-P) in the context list, we encourage the learned representation to disambiguate between similar, but not identical, biasing phrases. This improves biasing accuracy when there are several similar phrases in the biasing inventory. We carry out experiments in a large-scale data regime obtaining up to 7% relative word error rate reductions for the contextual portion of test data. We also extend and evaluate CATT approach in streaming applications.
翻译:本文提出了一种用于训练端到端上下文感知转换器转导(CATT)模型的扩展方法,采用了一种简单但高效的方法,即从上下文编码器的潜在空间中挖掘硬性负面短语。训练期间,给定一个参考查询,我们使用近似最近邻搜索挖掘大量相似短语。然后,这些采样短语与随机和基本事实的上下文信息一起用作上下文列表中的负面示例。通过将近似最近邻短语(ANN-P)包含在上下文列表中,我们鼓励学习到的表示在类似但不完全相同的偏见短语之间进行消除歧义。当偏见库中有几个相似的短语时,这将提高偏见准确性。我们在大规模数据情况下进行实验,针对测试数据的上下文部分获得最高7%的相对字误率降低。我们还扩展和评估CATT方法在流式应用中的应用。