In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. Although previous efforts have been extensively put on combining audio and visual modalities, most of them solely adopt a straightforward concatenation of audio and visual features. To exploit the real useful information behind these two modalities, we define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation (between phoneme and lip motion). These two correlations together comprise the complete information, which shows a certain superiority in separating target speaker's voice especially in some hard cases, such as the same gender or similar content. For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations. Both of them work well, while adversarial training shows its advantage by avoiding some limitations of contrastive learning. Compared with previous research, our solution demonstrates clear improvement on experimental metrics without additional complexity. Further analysis reveals the validity of the proposed architecture and its good potential for future extension.
翻译:在本文中,我们提出了针对视听语言分离任务的多模式多关系学习框架,尽管以前曾广泛努力将视听模式结合起来,但多数只是采用直接的视听特征组合。为了利用这两种模式背后的真正有用信息,我们定义了两个关键关联:(1) 身份相关性(Timbre和面部属性之间);(2) 语音相关性(电话和唇动)。这两种关联共同包括完整的信息,这显示在区分目标演讲者的声音方面有一定的优势,特别是在一些困难的情况下,例如相同的性别或类似的内容。对于实施而言,采用对比式学习或对称式培训方法来尽量扩大这两种关联性。两者都行之有效,而对抗性培训则表明其优势在于避免了一些对比性学习的局限性。与以往的研究相比,我们的解决方案表明实验性指标在不增加复杂性的情况下有了明显改善。进一步的分析揭示了拟议结构的有效性及其今后扩展的良好潜力。