In this paper, we focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer, which is important while surprisingly ignored by prior works. In particular, we conduct researches on three core questions: First, what to drop in self-attention layers? Different from dropping attention weights in literature, we propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically verify that this scheme helps keep both regularization and probability features of attention weights, alleviating the overfittings problem to specific patterns and enhancing the model to globally capture vital information; Second, how to schedule the drop ratio in consecutive layers? In contrast to exploit a constant drop ratio for all layers, we present a new decreasing schedule that gradually decreases the drop ratio along the stack of self-attention layers. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics, thus improving the robustness and stableness of model training; Third, whether need to perform structured dropout operation as CNN? We attempt patch-based block-version of dropout operation and find that this useful trick for CNN is not essential for ViT. Given exploration on the above three questions, we present the novel DropKey method that regards Key as the drop unit and exploits decreasing schedule for drop ratio, improving ViTs in a general way. Comprehensive experiments demonstrate the effectiveness of DropKey for various ViT architectures, e.g. T2T and VOLO, as well as for various vision tasks, e.g., image classification, object detection, human-object interaction detection and human body shape recovery.
翻译:本文重点分析和改进自我注意力层的丢失技术,这在先前的研究中尽管很重要但却被忽视了。具体来说,我们解决了三个核心问题:第一,自我注意力层中需要丢弃什么东西?不同于文献中丢弃注意力权重,我们提出了将丢失操作向前移动至注意力矩阵计算之前,并将Key作为丢失单元,从而产生了一种新型的softmax前的丢失方案。我们在理论上验证了此类方案有助于保持注意力权重的正则化和概率特性,减轻了对特定模式的过度拟合问题,并加强了模型对关键信息的全局捕捉能力;第二,如何在连续的层中安排掉落比率?与使用一致的掉落比率不同,我们提出一个新的渐减计划,沿着自我注意力层的堆叠逐渐降低掉落比率。我们在实验中验证了所提出的方案可以避免对低层特征的过度拟合和对高层语义的缺失,从而提高了模型训练的鲁棒性和稳定性;第三,是否需要像CNN一样执行结构化的掉落操作?我们尝试了基于块的丢失操作,并发现这个对CNN有用的技巧对ViT并不是必要的。鉴于上述三个问题的探究,我们提出了新的 DropKey 方法,将 Key 视为丢失单元,并利用渐减的掉落比率计划,以普适的方式提高了 ViT 的性能。全面的实验结果证明 DropKey 在各种 ViT 架构和各种视觉任务中均具有有效性,例如图像分类、目标检测、人-物交互检测和人体形状恢复。