End-to-end models have gradually become the main technical stream for voice trigger, aiming to achieve an utmost prediction accuracy but with a small footprint. In present paper, we propose an end-to-end voice trigger framework, namely WakeupNet, which is basically structured on a Transformer encoder. The purpose of this framework is to explore the context-capturing capability of Transformer, as sequential information is vital for wakeup-word detection. However, the conventional Transformer encoder is too large to fit our task. To address this issue, we introduce different model compression approaches to shrink the vanilla one into a tiny one, called mobile-Transformer. To evaluate the performance of mobile-Transformer, we conduct extensive experiments on a large public-available dataset HiMia. The obtained results indicate that introduced mobile-Transformer significantly outperforms other frequently used models for voice trigger in both clean and noisy scenarios.
翻译:终端到终端模型逐渐成为语音触发的主要技术流,目的是实现最大预测准确性,但足迹小。在本文件中,我们提议了一个端到端语音触发框架,即“唤醒网络”,基本以变换器编码器为结构结构。这一框架的目的是探索变换器的背景控制能力,因为序列信息对于测警至关重要。然而,常规变换器编码器太大,不适合我们的任务。为了解决这一问题,我们采用了不同的模型压缩方法,将香草压缩成一个小的,称为“移动变换器”。为了评估移动变换器的性能,我们对大型公共可获取的数据集Himia进行了广泛的实验。获得的结果显示,在清洁和噪音情景下,移动变换器引入了其他常用的语音触发模型。