Designing a speech-to-intent (S2I) agent which maps the users' spoken commands to the agents' desired task actions can be challenging due to the diverse grammatical and lexical preference of different users. As a remedy, we discuss a user-taught S2I system in this paper. The user-taught system learns from scratch from the users' spoken input with action demonstration, which ensure it is fully matched to the users' way of formulating intents and their articulation habits. The main issue is the scarce training data due to the user effort involved. Existing state-of-art approaches in this setting are based on non-negative matrix factorization (NMF) and capsule networks. In this paper we combine the encoder of an end-to-end ASR system with the prior NMF/capsule network-based user-taught decoder, and investigate whether pre-training methodology can reduce training data requirements for the NMF and capsule network. Experimental results show the pre-trained ASR-NMF framework significantly outperforms other models, and also, we discuss limitations of pre-training with different types of command-and-control(C&C) applications.
翻译:设计一个语音到意向(S2I)代理器,用来绘制用户对代理人所希望的任务行动的口头指令(S2I),由于不同用户有不同的语法和词汇偏好,因此可能具有挑战性。作为一种补救措施,我们在本文件中讨论用户教的S2I系统。用户教化系统从用户口语输入中从零开始学习,用行动演示确保它与用户拟订意图的方式及其表达习惯完全吻合。主要问题是由于用户的努力而导致的培训数据稀少。在这一环境中,现有最先进的方法基于非负矩阵因子化(NMF)和胶囊网络。在本文中,我们把终端到终端的ASR系统的编码器与先前的NMF/Capsule网络的用户图解码器结合起来,并调查培训前方法能否减少NMF和胶囊网络的培训数据要求。实验结果显示,培训前的AR-NMF框架大大超越了其他模型,我们还讨论培训前控制前和C应用程序的局限性。