This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) predict several target blocks in the image, (b) sample target blocks with sufficiently large scale (occupying 15%-20% of the image), and (c) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong downstream performance across a wide range of tasks requiring various levels of abstraction, from linear classification to object counting and depth prediction.
翻译:本文展示了一种在不依赖手动制作的数据放大的情况下学习高度语义图像表达方式的方法。 我们引入了基于图像的联合嵌入式预测结构(I-JEPA),这是从图像中学习自我监督的不创新的方法。 I-JEPA背后的想法很简单:从一个单一的上下文块中,预测同一图像中各个目标块的表示方式。 掩码战略是指导I-JEPA制作语义表达方式的核心设计选择;具体来说,它对于(a) 预测图像中的若干目标块至关重要,(b) 具有足够大规模的样本目标块(占图像的15%至20%),以及(c) 使用充分信息化(分布式分布式分布式分布式分布式分布式分布式分布式分布式分布式)的背景块。 我们发现I-JEPA与视觉变异器相结合时,我们发现高可缩缩略。 例如,我们用38小时以下的32 A100 GS对图像网络进行VIT-HO/16进行图像网络培训,以在需要从线性到深度和深度等一系列广泛任务中实现强有力的下游运行。