Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. Finally, we introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
翻译:最近自我监督的方法已经使用大型图像文本数据集来学习向许多任务转移而无需微调的强有力的图像文本数据集。 这些方法往往假设图像与其(短)标题之间有一对一的对应。 但是,许多任务需要多图像和长文本叙述的推理,例如描述带有视觉摘要的新闻报道文章。 因此, 我们探索了一个新颖的设置, 目的是学习一种自我监督的视觉语言表达方式, 该表达方式强于不同的文本长度和图像数量。 此外, 与先前的工作不同, 假设标题与图像有字面关系, 我们假设图像只包含与文本的松散示例性对应。 为了探讨这个问题, 我们引入了一个包含超过 31 M 条文章、 22 M 图像和 1 M 视频的大型多式数据集。 我们显示, 状态的图像文本校正方法对于带有多个图像的较长的描述方式并不有力。 最后, 我们引入了一个直观的基线, 超越了在GoodNews 数据集上零射图像检索10%的方法。