This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed masking, where the same attention mask is applied at every frame, with chunked masking, where the attention mask for each frame is determined by chunk boundaries, in terms of recognition accuracy and latency. We then explore the use of variable masking, where the attention masks are sampled from a target distribution at training time, to build models that can work in different configurations. Finally, we investigate how a single configurable model can be used to perform both first pass streaming recognition and second pass acoustic rescoring. Experiments show that chunked masking achieves a better accuracy vs latency trade-off compared to fixed masking, both with and without FastEmit. We also show that variable masking improves the accuracy by up to 8% relative in the acoustic re-scoring scenario.
翻译:这项工作研究在变压器转换器转换器语音识别中如何使用关注掩码来为不同的部署情景建立一个单一的可配置模型。 我们展示了一套全面的实验,比较了固定掩码, 在每个框中都使用相同的关注掩码, 并用块面遮罩, 每个框的注意掩码由块的界限、 识别精确度和延缓度来决定。 然后我们探索了使用变量掩码, 即从培训时间的目标分布中取样关注掩码, 以构建能够在不同配置中起作用的模型。 最后, 我们调查了如何使用单一的可配置模型来进行第一次通过流识别和第二次通过声学重新定位。 实验显示, 与固定掩码相比, 块面遮罩的精确度更高, 与固定掩码相比, 不论是否快速 Emit 。 我们还表明, 变量掩码可以提高精确度, 在声波再扫描假设中, 将比近8% 的精确度提高 。