动态调制平整改善愿景变换器 (Dynamic Token Normalization Improves Vision Transformer)

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. {Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by $0.5\%$ - $1.2\%$ top-1 accuracy on ImageNet, by $1.2$ - $1.4$ box AP in object detection on COCO benchmark, by $2.3\%$ - $3.9\%$ mCE in robustness experiments on ImageNet-C, and by $0.5\%$ - $0.8\%$ accuracy in Long ListOps on Long-Range Arena.} Codes will be made public at \url{https://github.com/wqshao126/DTN}

翻译：视觉变异器( VIT) 及其变体( 例如, Swin, PVT) 在各种计算机变异器任务中取得了巨大成功, 因为他们有能力学习远程背景信息。层正常化( LN) 是这些模型中一个必不可少的元素。然而, 我们发现普通的 LN 在不同位置上产生象征, 因为它在每类内嵌入常态。变异器很难在与LN 的图像中捕捉感官偏向性( 如, Swin, PVT) 。我们解决这个问题的方法是提出一个新的归正器, 称为动态 Token 正常化( DTN), 普通在每类( tra- tken) 和不同符号( inter- t) 中进行正统化。首先, 普通的LNNNNN( ral- t) 在不同的位置上建模( D) 以内嵌入的 D- t 内嵌入内嵌入式 Order 。 DT 会学会以内和内建模OD 的变现, 通过SOI- t 的变现为SVIT 底的变压, 通过SLVT 的变压, 通过SLVT 的变压, 通过SLVT 底的变压, 在各种变现成的LT 的变现成式的变现成的LVT 3 。