Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the "best" layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.
翻译:近些年来,视觉变异器(Viet Generalers)已获得显著的受欢迎程度,并已扩散到许多应用中。然而,还没有很好地探索他们的行为在不同的学习范式下如何不同。我们比较了通过不同监督方法培训的ViT,我们比较了通过不同监督方法培训的ViT,并表明他们从注意力、表现和下游表现方面学习了各种各样的行为。我们还发现ViT行为在监督中是相互一致的,包括地方关注负责人的出现。这些都是自我关注的负责人,他们关注的标志与当前象征相近,并有固定的方向抵消,而我们的知识在以往的任何工作中都没有被强调过。我们的分析表明,ViT非常灵活,学会根据培训方法在不同顺序中处理本地和全球信息。我们发现,对比性自我统一的方法学会了与明确监督特征竞争的特点,它们甚至可以优于部分任务。我们还发现,基于重建的模式的描述显示,与对比性自我监督模式没有多大的相似性。最后,我们展示的是,不同任务处理的“最佳”层次,通过不同的方法和不同的任务的顺序,我们展示了不同的任务。