The paper presents a scalable approach for learning distributed representations over individual tokens and a holistic instance representation simultaneously. We use self-attention blocks to represent distributed tokens, followed by cross-attention blocks to aggregate the holistic instance. The core of the approach is the use of extremely large token masking (75%-90%) as the data augmentation for supervision. Our model, named ExtreMA, follows the plain BYOL approach where the instance representation from the unmasked subset is trained to predict that from the intact input. Learning requires the model to capture informative variations in an instance, instead of encouraging invariances. The paper makes three contributions: 1) Random masking is a strong and computationally efficient data augmentation for learning generalizable attention representations. 2) With multiple sampling per instance, extreme masking greatly speeds up learning and hungers for more data. 3) Distributed representations can be learned from the instance supervision alone, unlike per-token supervisions in masked modeling.
翻译:本文介绍了一种可扩展的方法,用于学习单个象征的分布式表达方式和整体实例表达方式。 我们使用自我注意区块来代表分布式表示方式,然后使用交叉注意区块来汇总整体实例。 这种方法的核心是使用极大型象征性掩码(75%-90%)作为数据增强监督功能。 我们的模型叫做ExtreMA, 遵循平原 BYOL 方法, 即无面子组别的实例代表方式经过培训, 以从完整输入中预测。 学习要求模型在某个实例中捕捉信息变异, 而不是鼓励变异。 论文做出了三项贡献:(1) 随机掩码是一种强大的、计算效率高的数据增强, 以学习可普遍注意的表示方式。 (2) 通过多次取样, 极端掩码大大加快学习速度, 渴望获得更多数据。 3) 光从实例监督中可以学习分布式表达方式, 与蒙面模型中的按人监督方法不同。