The paper presents a scalable approach for learning spatially distributed visual representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent spatially distributed tokens, followed by cross-attention blocks to aggregate the holistic image instance. The core of the approach is the use of extremely large token masking (75\%-90\%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Instead of encouraging invariance across inputs, the model is required to capture informative variations in an image. The paper makes three contributions: 1) It presents random masking as a strong and computationally efficient data augmentation for siamese representation learning. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and improves performance with more data. 3) ExtreMA obtains stronger linear probing performance than masked modeling methods, and better transfer performance than prior contrastive models.
翻译:本文展示了一种可扩展的方法,用于学习单个符号上空间分布的视觉表现和同时的整体实例表现。 我们使用自我注意区块来代表空间分布的符号,然后是交叉注意区块来汇总整体图像实例。 这种方法的核心是使用极大型象征性遮罩(75<unk> -90<unk> )作为数据增强监督的数据。 我们的模型叫做ExtreMA, 遵循平原 BYOL 方法, 由未涂色子组进行实例表现训练, 以预测完整输入的功能。 模型不是鼓励输入之间的差异, 而是需要用模型来捕捉图像中的信息变异。 本文做出三项贡献:(1) 随机遮罩作为用于结构代表学习的强大和计算效率高的数据增强。 (2) 通过多次取样, 极端遮盖了学习速度,用更多数据改进了性能。 (3) ExtreMA 获得比前对比模型更好的传输性能更强的直线探测性工作。</s>