We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
翻译:我们在第九届CHiME挑战赛中提出了多模态上下文感知识别(MCoRec)任务,该任务通过融合音频、视觉和上下文线索来解决单房间场景下重叠对话的鸡尾酒会问题。MCoRec采集自然的多方对话数据,其录制内容聚焦于无脚本的日常群组聊天,导致高达100%的极端语音重叠与高度碎片化的对话轮次。该任务要求系统通过回答"谁在何时说了什么、与谁对话?"这一问题,从音视频记录中同步转写每位说话者的语音并将其聚类至对应的对话流。纯音频基线系统的词错误率超过100%,而引入视觉线索可带来50%的显著性能提升,凸显了多模态融合的重要性。本文阐述了该任务的设计动机,概述了数据采集流程,并报告了为MCoRec开发的基线系统。