The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.
翻译:以多模式信息为基础的语音处理(MISP)挑战旨在通过促进对觉醒词、发言者的分化、语音识别和其他技术的研究,扩大信号处理技术在具体情景中的应用,MISP2022挑战有两条轨道:1)视听演讲者分化(AVSD),旨在解决“使用视听数据时谁说话”的问题;2)新的视听分化和识别(AVDR)任务,重点是解决“用视听演讲人分化结果讲什么话的人”的问题;两条轨道都侧重于中文,在真实的家庭电视中使用远方的音频和视频:2-6人以背景中的电视噪音相互交流;本文介绍MISP22挑战的数据集、音轨设置和基线;我们对AVDR基线系统的良好性能和实例的分析,以及由于远程视频质量、背景中的电视噪音和令人不解的演讲者等原因导致的这一挑战的潜在困难。</s>