In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3x energy consumption reduction.
翻译:在关键词定位( KWS) 背景下, 以可学习功能取代手工制作的语音功能并未产生优异的 KWS 性能 。 在本研究中, 我们证明过滤库学习超过手工制作的语音功能在过滤库频道数量严重减少时会给 KWS 带来超强的手工制作的语音功能。 减少频道数量可能会导致某些 KWS 性能下降, 但也会导致能源消耗大幅下降, 这是在低资源设备上部署通用的总是 KWS 的关键。 谷歌语音指令数据集的噪音版本的实验结果显示, 过滤库学习适应噪音特性, 以对噪音提供更高程度的稳健度, 特别是在将辍学时。 因此, 从通常使用的40 频道日志- 移动功能转换为8 频道学习的功能, 导致相对的 KWS 精度损失仅为3.5%, 同时实现6.3 3 节能减少 。