The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.
翻译:基于音频模式的关键字识别系统(KWS)的性能,通常以假警报和假拒绝计量,在远处和吵闹的条件下显著退化。因此,最近人们非常关注利用多种模式互补关系的视听关键字识别系统(KWS)的性能,但目前的研究主要侧重于将完全了解的不同模式的表达方式结合起来,而不是探索各自建模期间的模式关系。在本文件中,我们提出了一个新型的视觉模式,即强化端到端KWS框架(Ve-KWS),将音频和视觉模式从两个方面结合起来。第一个是利用在视频中从嘴边区域获得的语音定位信息协助培训多频道音频信号。通过将光束作为增强音频的模块,可以大大抑制远处或噪音环境造成的声学扭曲。另一个是在不同模式之间进行交叉使用,以捕捉不同模式关系,帮助了解每种模式的演示模式。在MSIP挑战堆上进行实验显示,我们提议的模型实现了2.79 %的虚假拒绝率和2.95 %的图像系统,在EVAS-IS-ISA的升级中,在Eval系统中将产生高空的IASISISISISA标准。</s>