Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.
翻译:现代觉醒单词探测系统通常依靠神经网络进行声学模型。 变异器最近显示, 各种序列模拟任务中LSTM 和进化网络的性能优于LSTM 和进化网络, 具有更好的时间模型能力。 但是, 尚不清楚这种优势是否仍然适用于短距离时间模型, 如测醒单词。 此外, 香草变异器由于其非流性以及四面形时间和空间复杂性, 并不直接适用于任务。 本文中我们探讨了最近提议的LF- MMI系统中为测醒单而专门设计的成块变形变形变形变形器的性能, 包括直视下一个块、 梯度停止、 不同位置嵌入法和在块间添加同层依赖性。 我们在 Mobvoi 唤醒单词数据集上的实验表明, 我们提议的变形模型平均以类似模型大小的虚假拒绝率超过基线变动网络的25%, 平均以类似型号超速率, 但仍保持线性复杂 w.r. t 序列长度 。