Accurate recognition of slot values such as domain specific words or named entities by automatic speech recognition (ASR) systems forms the core of the Goal-oriented Dialogue Systems. Although it is a critical step with direct impact on downstream tasks such as language understanding, many domain agnostic ASR systems tend to perform poorly on domain specific or long tail words. They are often supplemented with slot error correcting systems but it is often hard for any neural model to directly output such rare entity words. To address this problem, we propose k-nearest neighbor (k-NN) search that outputs domain-specific entities from an explicit datastore. We improve error correction rate by conveniently augmenting a pretrained joint phoneme and text based transformer sequence to sequence model with k-NN search during inference. We evaluate our proposed approach on five different domains containing long tail slot entities such as full names, airports, street names, cities, states. Our best performing error correction model shows a relative improvement of 7.4% in word error rate (WER) on rare word entities over the baseline and also achieves a relative WER improvement of 9.8% on an out of vocabulary (OOV) test set.
翻译:通过自动语音识别(ASR)系统准确承认空格值,如域特定字或命名实体,这是面向目标的对话系统的核心。虽然这是一个关键步骤,直接影响到语言理解等下游任务,但许多域不可知的ASR系统往往在具体域或长尾字上表现不佳。这些系统往往被用空格错误纠正系统加以补充,但任何神经模型都很难直接输出这种稀有实体字词。为了解决这一问题,我们提议从一个明确的数据存储处搜索输出特定域实体的K-近邻(k-NNN)。我们通过方便地增加预先训练的联合电话和基于文本的变异器序列,以在推断中以 k-NN 搜索为模式进行排序。我们评估了我们提出的五个不同领域的方法,其中包括长尾号实体,如完整名称、机场、街道名称、城市、州。我们最佳的错误纠正模型显示,在稀有字数实体的字差率率(WER)相对提高7.4%,并在词汇之外的测试中实现了9.8%的相对WER改进。