Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.
翻译:然而,互联网上的图像-文本配对共同存在通常缺乏清晰的匹配信息,这是VLP的次最佳信息。 现有建议采用现成物体探测器的方法,以使用额外的图像标签信息。 然而,对象探测器耗费时间,只能确定预先界定的对象类别,限制了模型能力。 观察到文本包含不完整的细微图像信息,我们引入了IDEA,这代表着通过VLP的在线多标签识别来增加文本多样性。 IDEA表明,在VLP期间,用从文本中提取的图像标签进行多标签学习,可以共同优化。 此外,IDEA还可以在网上找到有价值的图像标签,以提供更明确的文本监督。全面实验表明,IDEA能够大大提升多个下游数据集的性能,并带来少量的计算成本。