Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
翻译:视听语音增强(AVSE)是一项利用视觉辅助信息从混合音频中提取目标说话人语音的任务。在现实场景中,常存在复杂的声学环境,伴随多种干扰声与混响。以往多数方法难以应对此类复杂条件,导致提取语音的感知质量较差。本文提出一种在复杂声学环境中表现优异的有效AVSE系统。具体而言,我们设计了一种可扩展至其他AVSE网络的“先分离后去混响”处理流程。第四届COGMHEAR视听语音增强挑战赛(AVSEC)旨在探索多模态复杂环境中语音处理的新方法。我们在AVSEC-4中验证了系统性能:在竞赛排行榜的三项客观指标上取得优异结果,并最终在人类主观听力测试中获得第一名。