It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we present a multi-channel ConvMixer for speech command recognitions. The novel architecture introduces an additional audio channel mixing for channel audio interaction in a multi-channel audio setting to achieve better noise-robust features with more efficient computation. Besides, we proposed a centroid based awareness component to enhance the system by equipping it with additional spatial geometry information in the latent feature projection space. We evaluate our model using the new MISP challenge 2021 dataset. Our model achieves significant improvement against the official baseline with a 55% gain in the competition score (0.152) on raw microphone array input and a 63% (0.126) boost upon front-end speech enhancement.
翻译:关键字识别模型要有一个很小的足迹, 因为它通常以低计算资源运行在多频道音频环境中。 但是, 保持先前的SOTA性能, 其模型尺寸较小, 具有挑战性。 此外, 一个有多个信号干扰的远方和吵闹环境, 使问题更加严重, 导致精确度大幅下降 。 在本文中, 我们为语音指令识别提供了一个多频道 ConmMixer 。 新的结构在多频道音频设置中引入了另一个音频混合频道, 用于频道音频互动, 以便以更高效的计算实现更好的噪音- 紫外线特征 。 此外, 我们提议了一个基于机器人的认知部分, 通过在潜在地貌预测空间为系统配备更多的空间几何信息来增强系统。 我们使用新的 MISP 挑战 2021 数据集来评估我们的模型 。 我们的模型比官方基线显著改进, 竞争得分( 0.152 ), 在原始麦克风阵列输入上获得了 63% ( 0.126 ) 。