We revisit the weakly supervised cross-modal face-name alignment task; that is, given an image and a caption, we label the faces in the image with the names occurring in the caption. Whereas past approaches have learned the latent alignment between names and faces by uncertainty reasoning over a set of images and their respective captions, in this paper, we rely on appropriate loss functions to learn the alignments in a neural network setting and propose SECLA and SECLA-B. SECLA is a Symmetry-Enhanced Contrastive Learning-based Alignment model that can effectively maximize the similarity scores between corresponding faces and names in a weakly supervised fashion. A variation of the model, SECLA-B, learns to align names and faces as humans do, that is, learning from easy to hard cases to further increase the performance of SECLA. More specifically, SECLA-B applies a two-stage learning framework: (1) Training the model on an easy subset with a few names and faces in each image-caption pair. (2) Leveraging the known pairs of names and faces from the easy cases using a bootstrapping strategy with additional loss to prevent forgetting and learning new alignments at the same time. We achieve state-of-the-art results for both the augmented Labeled Faces in the Wild dataset and the Celebrity Together dataset. In addition, we believe that our methods can be adapted to other multimodal news understanding tasks.
翻译:我们重新审视了监管不力的跨模式面相名称校正任务;也就是说,根据一个图像和标题,我们用标题中出现的名字标出图像中的面相;虽然过去的做法已经通过一系列图像及其各自的标题的不确定性推理而了解到名称和面相之间的潜在一致性,但我们在本文中依靠适当的损失功能来学习神经网络设置中的校正,并提议SELA和SELA-B。 SELA是一个基于对称-强化反向学习的校正模式,它能够以监管不力的方式有效地最大限度地扩大对应面和名称之间的相似性分数。SELA-B模式的变换方式是学会将名称和面相与人相统一,即从容易到困难的案例中学习,以进一步提高SELA的性。更具体地说,SELA-B应用一个两阶段学习框架:(1) 将模型训练成一个简单、有少数名字和面相容的组合,可以使每个图像配对中的已知姓名和面面面面相近似,可以以监管的方式,用宝路战略的变换方式,同时学习将名称和脸面数据转换为方向,从而忘忘忘地学习。