Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.
翻译:从多种来源获得的感官投入对于强大和连贯的人的认知至关重要。不同来源提供了补充的解释因素,并根据它们共同的因素进行组合。这个系统促使设计了强大的、不受监督的代表学习算法。在本文件中,我们统一了最近在单一框架内进行的多式自我监督学习工作。看到大多数自监督方法优化了一组模型组成部分之间的相似度度量度,我们建议了组织这一进程的所有合理方法的分类。我们从经验上展示了多种多式的MNIST和多式脑成像数据集的两种版本:(1)多式对比学习对其单式对应方有很大的好处,(2)多式对比性目标的具体构成对于下游任务的业绩至关重要,(3)使各种代表之间的相似性最大化对神经网络具有常规化效应,有时会导致下游性下降,但仍能揭示多式联运关系。因此,我们比以前在线性评价协议上的各种数据集上采用的非超强的编码-分解法方法大了。