We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing visual pre-training methods, which solve a proxy prediction task in a single domain, our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously, hence improving the quality of learned visual representations. By including multimodal training in a unified framework with different types of contrastive losses, our method can learn more powerful and generic visual features. We first train our model on COCO and evaluate the learned visual representations on various downstream tasks including image classification, object detection, and instance segmentation. For example, the visual representations pre-trained on COCO by our method achieve state-of-the-art top-1 validation accuracy of $55.3\%$ on ImageNet classification, under the common transfer protocol. We also evaluate our method on the large-scale Stock images dataset and show its effectiveness on multi-label image tagging, and cross-modal retrieval tasks.
翻译:我们开发了一种方法来学习包含多式联运数据的视觉表现,这种表现方式包括多种模式内和多种模式间相似保存目标的组合。与现有的在单一领域解决代理预测任务的现有视觉培训前方法不同,我们的方法同时利用每种模式的内在数据属性和跨模式相关性的语义信息,从而改进了所学视觉表现的质量。通过将多式培训纳入具有不同类型对比损失的统一框架,我们的方法可以学习更强大和通用的视觉特征。我们首先培训我们的COCO模型,并评价关于各种下游任务,包括图像分类、物体探测和实例分割的有学识的视觉表现。例如,我们用我们的方法预先训练的COCO的视觉表现方式在图像网络分类上达到最先进的最高一级核实精确度55.3-1美元,这是共同传输协议规定的。我们还评估了我们大规模股票图像数据集的方法,并展示了它在多标签图像标记和跨模式检索任务方面的有效性。