This paper further explores our previous wake word spotting system ranked 2-nd in Track 1 of the MISP Challenge 2021. First, we investigate a robust unimodal approach based on 3D and 2D convolution and adopt the simple attention module (SimAM) for our system to improve performance. Second, we explore different combinations of data augmentation methods for better performance. Finally, we study the fusion strategies, including score-level, cascaded and neural fusion. Our proposed multimodal system leverages multimodal features and uses the complementary visual information to mitigate the performance degradation of audio-only systems in complex acoustic scenarios. Our system obtains a false reject rate of 2.15% and a false alarm rate of 3.44% in the evaluation set of the competition database, which achieves the new state-of-the-art performance by 21% relative improvement compared to previous systems.
翻译:本文进一步探讨了我们先前在MISP 挑战2021第1轨中排行第2位的单词识别系统。 首先,我们调查基于3D和2D演变的稳健单式方法,并采用系统简单关注模块(SIMAM)来提高性能。第二,我们探索数据增强方法的不同组合,以提高性能。最后,我们研究聚合战略,包括分级、级联和神经聚合。我们提议的多式联运系统利用多式功能并使用补充视觉信息来减轻音频系统在复杂声学情景中的性能退化。我们的系统在竞争数据库的评价中获得了2.15%的虚假拒绝率和3.44%的虚假警报率,与以前的系统相比,新最先进的性能提高了21%。</s>