Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.
翻译:视听目标说话人提取模型主要依赖于目标说话人的视觉线索。然而,人类在提取目标语音时还会利用语言知识,如句法约束、下一词预测及对话先验信息。受此启发,我们提出ELEGANCE,一种新颖的框架,通过三种不同的引导策略将大语言模型的语言知识融入视听目标说话人提取模型:输出语言约束、中间语言预测和输入语言先验。基于RoBERTa、Qwen3-0.6B和Qwen3-4B在两种视听目标说话人提取骨干模型上的综合实验验证了本方法的有效性。在视觉线索受损、未见语言、目标说话人切换、干扰说话人增多及领域外测试集等挑战性场景中均观察到显著性能提升。演示页面:https://alexwxwu.github.io/ELEGANCE/。