Speech command recognition (SCR) has been commonly used on resource constrained devices to achieve hands-free user experience. However, in real applications, confusion among commands with similar pronunciations often happens due to the limited capacity of small models deployed on edge devices, which drastically affects the user experience. In this paper, inspired by the advances of discriminative training in speech recognition, we propose a novel minimize sequential confusion error (MSCE) training criterion particularly for SCR, aiming to alleviate the command confusion problem. Specifically, we aim to improve the ability of discriminating the target command from other commands on the basis of MCE discriminative criteria. We define the likelihood of different commands through connectionist temporal classification (CTC). During training, we propose several strategies to use prior knowledge creating a confusing sequence set for similar-sounding command instead of creating the whole non-target command set, which can better save the training resources and effectively reduce command confusion errors. Specifically, we design and compare three different strategies for confusing set construction. By using our proposed method, we can relatively reduce the False Reject Rate~(FRR) by 33.7% at 0.01 False Alarm Rate~(FAR) and confusion errors by 18.28% on our collected speech command set.
翻译:然而,在实际应用中,使用类似发音指令之间的混乱经常发生,原因是在边缘设备上部署的小模型能力有限,这严重影响了用户的经验。在本文中,由于语音识别方面的歧视性培训的进展,我们提议了一个新的尽量减少连续错(MSCE)的培训标准,特别是针对SCR,目的是减轻指令混乱问题。具体地说,我们的目标是提高根据 MCE 区分目标命令和其他命令指令的能力。我们通过连接时间分类(CTC)确定不同命令的可能性。在培训期间,我们提出几项战略,利用先前的知识为类似命令设定一个混乱的顺序,而不是创建整个非目标命令组,这可以更好地节省培训资源,有效地减少命令混乱错误。具体地说,我们设计并比较了三种不同的策略,以缓解指令混乱的设置。我们采用拟议的方法,我们可以相对地将误差率(FRRR)降低33.7%(FRRR),在0.01错误误差率~(FAR)和错误差18.28%中我们收集了18.28%的语音。