Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{https://github.com/Hxyou/MSCLIP}{URL}.
翻译:大型多式对比培训前的大规模多式培训显示,通过将多种模式映射成共同嵌入空间,学习一系列下游任务的可转让特性非常有用。 通常, 这为每种模式采用了不同的编码器。 然而, 最近的工作表明, 变压器可以支持跨多种模式的学习, 并允许知识共享。 受此启发, 我们调查了多种模式- 共享对比语言预培训( MS- CLIP) 框架。 更具体地说, 我们质疑在对比性培训前, 并严格审查将参数比例定位于一个频谱的建筑设计选择。 在所研究的环境下, 我们观察到, 视觉和语言信号的多数统一的编码器可以支持不同模式的所有其他变异。 此外, 我们发现, 轻质模式的平行模块进一步提高了性能。 实验结果表明, 拟议的MS- CLIP 方法在零版图像网络分类中比VanLIP 高13 相对化了 Vanqolgi 。 ( 在YFC- 100M 上, 之前对通用参数的比例) 设计选择, 并同时支持将常规 CLIP 格式 格式 的常规化 格式 化 。