This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformed. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.
翻译:本文介绍了我们为多模式信息语音处理(MISP)挑战2021任务1而设计的系统的细节。任务1的目的是利用音频和视频信息,提高远方醒醒字定位的环境稳健性。首先,在拟议的系统中,我们利用波束成形和加权预测错误等语音增强算法来解决多声频对话音频问题。第二,运用了几种数据增强技术模拟更现实的远方情景。对于视频信息,所提供感兴趣的区域(ROI)被用于获得视觉代表。然后,多层CNN建议学习音频和视觉表现,这些表现被输入到我们的双部门关注网络中,可以用于凝聚,例如变压器和相容。焦点损失被用来微调模型并显著改进性能。最后,通过投票将多个经过培训的模型整合到最后的 0.091 分中。