An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to teach a retrieval model to distinguish the entailed captions from other negative samples. In experiments, we manually annotate an entailment-corrected image-text retrieval dataset for evaluation. The results demonstrate that the proposed entailment classifier achieves about 78% accuracy and consistently improves the performance of image-text retrieval baselines.
翻译:尚未解决的图像文本检索模型取决于高质量的标签数据。 虽然现有图像文本检索数据集的构建者努力确保标题与链接的图像匹配, 但他们不能阻止标题与其他图像相匹配。 我们注意到, 在广泛使用的检索数据集中, 如此多到多的匹配现象非常常见, 一个标题可以描述到 178 个图像。 这些巨大的匹配丢失数据不仅混淆了培训中的模型, 而且还削弱了评估的准确性。 在视觉和文字要求的任务的启发下, 我们提议一个多式的隐含分类器, 以确定一个句子是否由图像及其链接的标题所产生 。 随后, 我们修改图像文本检索数据集, 将这些标题添加为图像的额外弱标签, 并开发一个通用的可变速率战略, 教给一个检索模型, 以区分所需的标题与其他负面的样本。 在实验中, 我们手动一个包含修正的图像文本检索数据集, 用于评估。 结果表明, 拟议的生成的分类实现了大约 78% 的准确性, 并不断改进图像文本检索基线的性能 。