We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning.
翻译:我们展示了一种方法,从谈话脸部视频中学习语音表情,而没有任何身份标签。 以往的作品采用跨式实例歧视任务来建立声音和脸的关联性。 这些方法忽略了不同视频的语义内容,引入了假阴性配对作为培训噪音。 此外, 正对配是根据音剪和视觉框架之间的自然关联构建的。 然而,这种关联性在大量真实世界数据中可能是薄弱或不准确的,这可能导致将正反向反向模式。 为了解决这些问题,我们提出了跨式原型对比性学习(CMPC),它利用了对比性方法,抵制了虚假负面和反正偏差的负面效应。 一方面,CMPC可以通过不同模式的不受监督的组合来构建语义性积极的正反效果来学习。 另一方面,通过对比跨式实例的相似性,我们动态地重新校正了不可忽略的原型实例(CMPC),它利用了对比性方法,并抵制了虚假负面和反正偏差的负面效果的不利影响。一方面,CMPC可以通过不同模式构建出我们先前的变式评估方法。