Building efficient architecture in neural speech processing is paramount to success in keyword spotting deployment. However, it is very challenging for lightweight models to achieve noise robustness with concise neural operations. In a real-world application, the user environment is typically noisy and may also contain reverberations. We proposed a novel feature interactive convolutional model with merely 100K parameters to tackle this under the noisy far-field condition. The interactive unit is proposed in place of the attention module that promotes the flow of information with more efficient computations. Moreover, curriculum-based multi-condition training is adopted to attain better noise robustness. Our model achieves 98.2% top-1 accuracy on Google Speech Command V2-12 and is competitive against large transformer models under the designed noise condition.
翻译:在神经语音处理中建立高效结构对于关键词定位的成功部署至关重要。 但是,对于轻量级模型来说,通过简明神经操作实现噪声稳健是极具挑战性的。 在现实世界应用中,用户环境通常很吵,也可能含有反响。 我们提出了一个新颖的特效互动演动模型,只有100K参数,以在噪音远处的状态下解决这个问题。 交互式单元建议取代关注模块,促进信息流动,以更有效的计算。 此外,还采用了基于课程的多条件培训,以达到更好的噪声稳健。 我们的模型在谷歌语音指令V2-12中实现了98.2%的最高一级精度,在设计噪音条件下与大型变压器模型相比具有竞争力。