Learned speech representations can drastically improve performance on tasks with limited labeled data. However, due to their size and complexity, learned representations have limited utility in mobile settings where run-time performance can be a significant bottleneck. In this work, we propose a class of lightweight non-semantic speech embedding models that run efficiently on mobile devices based on the recently proposed TRILL speech embedding. We combine novel architectural modifications with existing speed-up techniques to create embedding models that are fast enough to run in real-time on a mobile device and exhibit minimal performance degradation on a benchmark of non-semantic speech tasks. One such model (FRILL) is 32x faster on a Pixel 1 smartphone and 40% the size of TRILL, with an average decrease in accuracy of only 2%. To our knowledge, FRILL is the highest-quality non-semantic embedding designed for use on mobile devices. Furthermore, we demonstrate that these representations are useful for mobile health tasks such as non-speech human sounds detection and face-masked speech detection. Our models and code are publicly available.
翻译:然而,由于其规模和复杂性,在运行时性能可能是一个重大瓶颈的移动环境中,学习的表达方式的效用有限。在这项工作中,我们提议了一类轻量非闭塞语言嵌入模型,根据最近提议的TRILL语言嵌入,在移动设备上高效运行。我们把新的建筑改造与现有的加速技术结合起来,以创建在移动设备上实时运行的嵌入模型,并在非语义语言任务的基准上表现出最低性能退化。其中一种模型(FRILL)在Pixel 1智能手机上速度快32x,而TRIL的尺寸则快40%,平均精确度仅下降2%。据我们所知,FRILL是设计用于移动设备的最高质量的非语性嵌入式。此外,我们证明这些表达方式对移动健康任务有用,例如非语音人类声音探测和面对面语音探测。我们的模型和代码是公开提供的。