Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. The source code is publicly available for the community to train on bigger corpora: https://github.com/Sara-Ahmed/GMML.
翻译:视觉变异器因其在利用背景信息方面的灵活性而引起了人们对计算机视觉界的极大兴趣,无论是在本地还是全球范围范围较广的高度限制信息。 然而,人们知道,视觉变异器是数据饥饿。这促使对自我监督的变异器预培训进行研究,而无需解码标签传送的语义信息以将其与图像属性连接起来,而是直接侧重于提取反映相似概念的图像数据的简明表达方式,这种图像数据反映相似性的概念,并且不易受到干扰因素的影响。大多数自学方法所使用的自学进程的关键工具是生成培训数据的多重观点,以及创建借口任务,利用这些观点来定义图像相似性和数据完整性的概念。然而,这种方法缺乏提取背景信息的自然倾向。我们建议为预导视觉变异器提供自我监督学习(SSL)机制,能够提取图像中所有概念中的现有背景信息源。 GMLML(G) 最隐性地将当前变异性变异的自我变异技术通过随机调节, 将SLML(S) 最隐含的变异的变异性变异性定义的系统, 要求随后的SMAL(SL) 逐步) 逐渐的内化数据系统。