In-context vision and language models like Flamingo support arbitrarily interleaved sequences of images and text as input. This format not only enables few-shot learning via interleaving independent supervised (image, text) examples, but also, more complex prompts involving interaction between images, e.g., "What do image A and image B have in common?" To support this interface, pretraining occurs over web corpora that similarly contain interleaved images+text. To date, however, large-scale data of this form have not been publicly available. We release Multimodal C4 (mmc4), an augmentation of the popular text-only c4 corpus with images interleaved. We use a linear assignment algorithm to place images into longer bodies of text using CLIP features, a process that we show outperforms alternatives. mmc4 spans everyday topics like cooking, travel, technology, etc. A manual inspection of a random sample of documents shows that a vast majority (90%) of images are topically relevant, and that linear assignment frequently selects individual sentences specifically well-aligned with each image (78%). After filtering NSFW images, ads, etc., the corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
翻译:在上下文视觉语言模型(如Flamingo)中,图像和文本被交错以任意顺序组成输入。这种格式不仅通过交错独立的监督(图像,文本)示例来实现小样本学习,还支持更复杂的提示,包括图像之间的交互,例如“图像A和图像B有什么共同之处?”为了支持这个接口,预训练在包含交错的图像和文本的Web语料库上进行。然而,迄今为止,这种形式的大规模数据并没有公开。我们发布了Multimodal C4(mmc4),这是一个流行的仅文本c4语料库与交错图像的增强版。我们使用一个线性分配算法,使用CLIP特征将图像放入更长的文本中,这个过程表现优于其他方案。mmc4涵盖了日常主题,如烹饪、旅行、技术等。随机抽取的文件的手动检查显示,绝大部分(90%)的图像与主题相关,并且线性分配经常选择与每个图像特别对齐的单句话(78%)。过滤NSFW图像、广告等内容后,该语料库包含103M个包含585M个交错图像和43B个英文标记的文档。