Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
翻译:多模式模型正在变得日益有效,部分原因是由于变形器结构等统一的组成部分,而部分由于变形器结构等,多模式模型正在逐渐变得日益有效,然而,多模式模型仍然经常由许多任务和模式特定部分和培训程序组成。例如,CLIP(Radford等人,2021年)通过对比性损失来培训独立的文本和图像塔。我们探索另一种统一:使用纯像素模型来执行图像、文本和多式联运任务。我们的模型仅以对比性损失来培训,因此我们称之为CLIP-Pixel only(CLIPPO) 。CLIPPO使用一个单一的编码,既处理常规图像,也处理以图像形式提供的文本。CLIPPO执行基于图像的任务,例如检索和零光图像分类几乎以及CLIP(CLIP),一半的参数,没有文本塔塔或嵌入式。如果通过图像对比性学习和下流式对比性学习联合培训,CLIPPO能够很好地完成自然语言理解任务,而没有任何字级的强烈损失(语言建模或掩码语言建模模型),不能在图像上实现业绩精确化的PIPIPIP(C-C-C-C-PO),不能在先前工作上取得良好的图像问题。